Delay first retry in Transient Error Handling with Azure SQL #27826

martyt · 2022-04-15T17:27:19Z

The published advice on transient error handling for Azure SQL recommends

Delaying "several" seconds before the first retry
Closing and opening the connection prior to eeach retry

I've poked around in SqlServerRetryingExecutionStrategy and related classes, and can't find any evidence that either of those two recommendations are followed in EF Core - nor have I been able to figure out how I might implement those recommendations in a custom execution strategy.

Additionally, I've found that an Execution Timeout Expired. exception (Error number -2) is explicitly not considered transient -- yet it is the single most frequently occurring exception we encounter in our non-EF database code. The retry strategy we've implemented for that non-EF code closes and re-opens the connection before retrying the query and has completely eliminated failures due to timeout exceptions. I've had to add error number -2 to the errorNumbersToAdd list for EF Core, but, because the connection isn't closed and re-opened, I have zero expectation that retries for those errors will be successful.

Is there a plan to support the recommended transient error handling when targeting Azure SQL?
Is there a way I can implement a custom execution strategy that will close and re-open the database connection?

The text was updated successfully, but these errors were encountered:

AndriySvyryd · 2022-04-20T00:10:21Z

You can change the delay between retries by overriding GetNextDelay

In most cases a retriable exception puts the connection in a Broken state, so it will be closed and reopened

The main reason for not retrying timeouts is that they could be caused by a query that uses too many resources, and retrying by default would hide this issue.

martyt · 2022-04-20T18:48:52Z

@AndriySvyryd Thanks for the reply. Since my original post, I've implemented a custom ExecutionStrategy that explicitly closes the connection, waits 5 seconds and then re-opens it (via an override of OnRetry()) - seems to be working fine but if I don't need to go to all that trouble, I'm definitely interested in changing my implementation.

In most cases a retriable exception puts the connection in a Broken state, so it will be closed and reopened

Can you point me to the code that sets the connection state to Broken when a retriable exception occurs? I dug through the EFCore repo and couldn't find anything obvious. I'd like to confirm exactly what happens on an exception and modify my code accordingly.

AndriySvyryd · 2022-04-20T19:12:08Z

Can you point me to the code that sets the connection state to Broken when a retriable exception occurs?

That's handled by SqlClient

ajcvickers · 2022-04-23T10:11:15Z

Note for triage: we should check the latest recommendations as part of the 7.0 process, since these have changed since the feature was implemented.

roji · 2022-04-24T09:34:25Z

Keep in mind that different error types may need different timeouts. For example, while availability-related errors may justify a longer timeout (to allow the service to become available again), a deadlock or concurrency-related error should probably be retried immediately (possibly even adding a bit of random jitter to increase the chances it isn't reproduced).

ajcvickers · 2022-06-15T10:04:37Z

Note from triage: also consider information here: #28200

MNF · 2022-07-07T13:46:17Z

. Since my original post, I've implemented a custom ExecutionStrategy that explicitly closes the connection, waits 5 seconds and then re-opens it (via an override of OnRetry()) - seems to be working fine but if I don't need to go to all that trouble, I'm definitely interested in changing my implementation.

@martyt could you please share your custom ExecutionStrategy implementation?

martyt · 2022-07-07T14:21:05Z

@MNF this is what we're using now - I removed all the extra logic to close and reopen the connection since it turned out to be unnecessary. The only things that this strategy does that's different from the built-in is to use the DecorrelatedJitterBackoffV2 method (in the Polly package) and treat SQL Timeout (-2) errors and IOException and SocketException as transients that should be retried.

/// <summary>
/// A custom sql execution strategy for EntityFramework that utilizes the DecorrelatedJitterBackoffV2 method from Polly
/// to calculate the delay before next retry.
/// </summary>
public class CustomEfCoreSqlExecutionStrategy : SqlServerRetryingExecutionStrategy
{
    private readonly IEnumerable<TimeSpan> _backoffDelays;

    public CustomEfCoreSqlExecutionStrategy(
        ExecutionStrategyDependencies dependencies,
        int maxRetryCount,
        TimeSpan maxRetryDelay,
        ICollection<int> errorNumbersToAdd
    ) : base(dependencies, maxRetryCount, maxRetryDelay, errorNumbersToAdd)
    {
        _backoffDelays = Backoff.DecorrelatedJitterBackoffV2(TimeSpan.FromSeconds(5), maxRetryCount);
    }

    /// <summary>
    /// Get the delay before the next retry using Polly's DecorrelatedJitterV2 implementation
    /// </summary>
    /// <param name="lastException"></param>
    /// <returns></returns>
    protected override TimeSpan? GetNextDelay(Exception lastException)
    {
        var currentRetryCount = ExceptionsEncountered.Count - 1;
        if (currentRetryCount < MaxRetryCount)
            return _backoffDelays.ElementAt(currentRetryCount);

        return null;
    }

    /// <summary>
    /// Experience has shown that we frequently encounter SqlExceptions that have an unknown error number but have an
    /// IOException and/or SocketException as the inner exception. So we need to treat those as transient.
    ///
    /// Also, the default EF Core retry strategy explicitly excludes SQL Timeout errors (-2), but those errors are the most frequent
    /// that we see in the wild. 
    /// </summary>
    /// <param name="exception"></param>
    /// <returns></returns>
    protected override bool ShouldRetryOn(Exception exception)
    {
        // all i/o exceptions are considered transient
        // error code -2 (timeout) is also transient, even though the base implementation says otherwise
        var shouldRetry =
            exception is SqlException {InnerException: IOException or SocketException}
                or SqlException {Number: -2} || base.ShouldRetryOn(exception);

        return shouldRetry;
    }
}

ajcvickers · 2022-09-06T11:51:49Z

Note for triage: proposing we punt on this for 7.0, since I'm not sure changing delays so close to the release is a great idea. We should instead make this change early in 8.0 so we have a chance to react to any incoming changes in reliability or performance characteristics.

stevendarby · 2023-01-10T10:41:39Z

A few questions...

@ajcvickers is this still on the radar for 8.0 given the last comment above?

The advice highlighted in this issue seems to be specifically about Azure SQL; is there an existing mechanism EF can use to detect when it's connecting to Azure SQL and not on-premise SQL Server? I'm wondering if this issue is more of an enhancement than a bug.

Since handling of -2 has also come up a few times in this issue, I have some questions/comments on that too.

The code has this comment:

efcore/src/EFCore.SqlServer/Storage/Internal/SqlServerTransientExceptionDetector.cs

Lines 199 to 202 in 16c1380

    
           // This exception can be thrown even if the operation completed successfully, so it's safer to let the application fail. 
        
           // DBNETLIB Error Code: -2 
        
           // Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. The statement has been terminated. 
        
           //case -2:

@AndriySvyryd also said above that the reason not to retry is because the query might be using too many resources. This article says timeout exceptions can be due to connection or query issues and provides a way to differentiate. Perhaps EF needs a mechanism for determining if -2 should be retried. We get a fair few of these with Azure SQL and on investigation they seem to be connection related so could probably have been retried, but we don't want to just add -2 to the list in case it's the query kind. Then there's also the comment in the code about it possibly being successful which adds further doubt around the matter.

AndriySvyryd · 2023-01-10T23:46:29Z

The advice highlighted in this issue seems to be specifically about Azure SQL; is there an existing mechanism EF can use to detect when it's connecting to Azure SQL and not on-premise SQL Server?

Not currently. It is recommended to use different Execution strategies depending on the target database.

This article says timeout exceptions can be due to connection or query issues and provides a way to differentiate. Perhaps EF needs a mechanism for determining if -2 should be retried.

Adding this would be out of scope for this issue. I've opened #30023 to track it.

stevendarby · 2023-01-16T00:33:43Z

@AndriySvyryd could you help me understand what a bug fix for this current issue might look like?

AndriySvyryd · 2023-01-18T17:54:24Z

@stevendarby We'll probably add an Azure-specific execution strategy that the user needs to choose explicitly

Detect Azure SQL based on connection string. Add more transient errors. Fixes #27826

Detect Azure SQL server based on connection string. Detect more transient errors. Fixes #27826

Increase the retry delay for throttling errors on Azure SQL. Detect Azure SQL server based on connection string. Detect more transient errors. Fixes #27826

stevendarby · 2023-09-05T20:00:21Z

@AndriySvyryd can I just check my reading of the changes around -2 is correct: it hasn't been added to the default list of codes to retry, however if the user has added that as an additional code, it will understand that as throttling error and applies different delays? All good if so. My first read had me slightly concerned that -2 was added so just wanted to check.

AndriySvyryd · 2023-09-05T20:14:25Z

@AndriySvyryd can I just check my reading of the changes around -2 is correct: it hasn't been added to the default list of codes to retry, however if the user has added that as an additional code, it will understand that as throttling error and applies different delays?

That's correct, -2 is still not retried by default.

martyt added the customer-reported label Apr 15, 2022

ajcvickers assigned AndriySvyryd Apr 23, 2022

ajcvickers added type-enhancement area-sqlserver labels Apr 23, 2022

ajcvickers added this to the 7.0.0 milestone Apr 23, 2022

ajcvickers mentioned this issue Jun 15, 2022

Add additional error codes / handling in SqlServerTransientExceptionDetector #28200

Closed

ajcvickers added propose-punt and removed propose-punt labels Jul 6, 2022

MNF mentioned this issue Jul 7, 2022

Add a link to code to find current set of transient errors dotnet/EntityFramework.Docs#3921

Open

AndriySvyryd changed the title ~~Transient Error Handling with Azure SQL~~ Delay first retry in Transient Error Handling with Azure SQL Aug 10, 2022

AndriySvyryd added type-bug area-sqlserver and removed area-sqlserver type-enhancement labels Aug 10, 2022

ajcvickers self-assigned this Sep 2, 2022

ajcvickers added the propose-punt label Sep 6, 2022

ajcvickers removed this from the 7.0.0 milestone Sep 6, 2022

ajcvickers added the priority-bug Issues which requires API breaks and have bigger impact hence should be fixed earlier in the release label Sep 7, 2022

ajcvickers added this to the Backlog milestone Sep 7, 2022

ajcvickers added punted-for-7.0 and removed propose-punt labels Sep 7, 2022

ajcvickers removed this from the Backlog milestone Jan 10, 2023

ajcvickers added the consider-for-current-release label Jan 11, 2023

ajcvickers added this to the 8.0.0 milestone Jan 11, 2023

frankbuckley mentioned this issue Jan 22, 2023

Enable execution strategy to control buffering independently of retries #30112

Closed

ajcvickers removed their assignment Feb 4, 2023

AndriySvyryd added a commit that referenced this issue Sep 2, 2023

Increase the retry delay for throttling on Azure SQL.

6c5e3f0

Detect Azure SQL based on connection string. Add more transient errors. Fixes #27826

AndriySvyryd mentioned this issue Sep 2, 2023

Increase the retry delay for throttling errors on Azure SQL. #31612

Merged

AndriySvyryd added a commit that referenced this issue Sep 2, 2023

Increase the retry delay for throttling errors on Azure SQL.

ea2160e

Detect Azure SQL server based on connection string. Detect more transient errors. Fixes #27826

AndriySvyryd closed this as completed in #31612 Sep 5, 2023

ajcvickers added the closed-fixed label Sep 5, 2023

ajcvickers unassigned AndriySvyryd Sep 5, 2023

ajcvickers modified the milestones: 8.0.0, 8.0.0-rc2 Sep 6, 2023

ajcvickers removed the consider-for-current-release label Oct 11, 2023

ajcvickers modified the milestones: 8.0.0-rc2, 8.0.0 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay first retry in Transient Error Handling with Azure SQL #27826

Delay first retry in Transient Error Handling with Azure SQL #27826

martyt commented Apr 15, 2022 •

edited

Loading

AndriySvyryd commented Apr 20, 2022

martyt commented Apr 20, 2022 •

edited

Loading

AndriySvyryd commented Apr 20, 2022

ajcvickers commented Apr 23, 2022

roji commented Apr 24, 2022

ajcvickers commented Jun 15, 2022

MNF commented Jul 7, 2022

martyt commented Jul 7, 2022

ajcvickers commented Sep 6, 2022

stevendarby commented Jan 10, 2023

AndriySvyryd commented Jan 10, 2023

stevendarby commented Jan 16, 2023 •

edited

Loading

AndriySvyryd commented Jan 18, 2023 •

edited

Loading

stevendarby commented Sep 5, 2023

AndriySvyryd commented Sep 5, 2023

Delay first retry in Transient Error Handling with Azure SQL #27826

Delay first retry in Transient Error Handling with Azure SQL #27826

Comments

martyt commented Apr 15, 2022 • edited Loading

AndriySvyryd commented Apr 20, 2022

martyt commented Apr 20, 2022 • edited Loading

AndriySvyryd commented Apr 20, 2022

ajcvickers commented Apr 23, 2022

roji commented Apr 24, 2022

ajcvickers commented Jun 15, 2022

MNF commented Jul 7, 2022

martyt commented Jul 7, 2022

ajcvickers commented Sep 6, 2022

stevendarby commented Jan 10, 2023

AndriySvyryd commented Jan 10, 2023

stevendarby commented Jan 16, 2023 • edited Loading

AndriySvyryd commented Jan 18, 2023 • edited Loading

stevendarby commented Sep 5, 2023

AndriySvyryd commented Sep 5, 2023

martyt commented Apr 15, 2022 •

edited

Loading

martyt commented Apr 20, 2022 •

edited

Loading

stevendarby commented Jan 16, 2023 •

edited

Loading

AndriySvyryd commented Jan 18, 2023 •

edited

Loading