-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Delay first retry in Transient Error Handling with Azure SQL #27826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can change the delay between retries by overriding GetNextDelay In most cases a retriable exception puts the connection in a Broken state, so it will be closed and reopened The main reason for not retrying timeouts is that they could be caused by a query that uses too many resources, and retrying by default would hide this issue. |
@AndriySvyryd Thanks for the reply. Since my original post, I've implemented a custom
Can you point me to the code that sets the connection state to Broken when a retriable exception occurs? I dug through the EFCore repo and couldn't find anything obvious. I'd like to confirm exactly what happens on an exception and modify my code accordingly. |
That's handled by SqlClient |
Note for triage: we should check the latest recommendations as part of the 7.0 process, since these have changed since the feature was implemented. |
Keep in mind that different error types may need different timeouts. For example, while availability-related errors may justify a longer timeout (to allow the service to become available again), a deadlock or concurrency-related error should probably be retried immediately (possibly even adding a bit of random jitter to increase the chances it isn't reproduced). |
Note from triage: also consider information here: #28200 |
@martyt could you please share your custom |
@MNF this is what we're using now - I removed all the extra logic to close and reopen the connection since it turned out to be unnecessary. The only things that this strategy does that's different from the built-in is to use the
|
Note for triage: proposing we punt on this for 7.0, since I'm not sure changing delays so close to the release is a great idea. We should instead make this change early in 8.0 so we have a chance to react to any incoming changes in reliability or performance characteristics. |
A few questions... @ajcvickers is this still on the radar for 8.0 given the last comment above? The advice highlighted in this issue seems to be specifically about Azure SQL; is there an existing mechanism EF can use to detect when it's connecting to Azure SQL and not on-premise SQL Server? I'm wondering if this issue is more of an enhancement than a bug. Since handling of -2 has also come up a few times in this issue, I have some questions/comments on that too. The code has this comment: efcore/src/EFCore.SqlServer/Storage/Internal/SqlServerTransientExceptionDetector.cs Lines 199 to 202 in 16c1380
@AndriySvyryd also said above that the reason not to retry is because the query might be using too many resources. This article says timeout exceptions can be due to connection or query issues and provides a way to differentiate. Perhaps EF needs a mechanism for determining if -2 should be retried. We get a fair few of these with Azure SQL and on investigation they seem to be connection related so could probably have been retried, but we don't want to just add -2 to the list in case it's the query kind. Then there's also the comment in the code about it possibly being successful which adds further doubt around the matter. |
Not currently. It is recommended to use different Execution strategies depending on the target database.
Adding this would be out of scope for this issue. I've opened #30023 to track it. |
@AndriySvyryd could you help me understand what a bug fix for this current issue might look like? |
@stevendarby We'll probably add an Azure-specific execution strategy that the user needs to choose explicitly |
Detect Azure SQL based on connection string. Add more transient errors. Fixes #27826
Detect Azure SQL server based on connection string. Detect more transient errors. Fixes #27826
Increase the retry delay for throttling errors on Azure SQL. Detect Azure SQL server based on connection string. Detect more transient errors. Fixes #27826
@AndriySvyryd can I just check my reading of the changes around -2 is correct: it hasn't been added to the default list of codes to retry, however if the user has added that as an additional code, it will understand that as throttling error and applies different delays? All good if so. My first read had me slightly concerned that -2 was added so just wanted to check. |
That's correct, -2 is still not retried by default. |
The published advice on transient error handling for Azure SQL recommends
I've poked around in
SqlServerRetryingExecutionStrategy
and related classes, and can't find any evidence that either of those two recommendations are followed in EF Core - nor have I been able to figure out how I might implement those recommendations in a custom execution strategy.Additionally, I've found that an
Execution Timeout Expired.
exception (Error number -2) is explicitly not considered transient -- yet it is the single most frequently occurring exception we encounter in our non-EF database code. The retry strategy we've implemented for that non-EF code closes and re-opens the connection before retrying the query and has completely eliminated failures due to timeout exceptions. I've had to add error number -2 to theerrorNumbersToAdd
list for EF Core, but, because the connection isn't closed and re-opened, I have zero expectation that retries for those errors will be successful.Is there a plan to support the recommended transient error handling when targeting Azure SQL?
Is there a way I can implement a custom execution strategy that will close and re-open the database connection?
The text was updated successfully, but these errors were encountered: