Polly: Honouring retries from 429 response

I am attempting to capture a 429 response, and the advised retry interval from the HTTPResponse, as outlined in the blog post here

I am using .Net Core, with version 5.1 (current latest) of Polly.

I have started trying to get a simple HTTPResponse captured, but ultimately I want to use Polly to handle retries for the .Net Core DocumentDB client.

The example code in the blog post is as follows:

var honoursRetryAfter = Policy<HttpResponseMessage>  
  .Handle<HttpException>
  .OrResult(r => statusCodesToRetry.Contains(r.StatusCode)) // detail elided
  .RetryAsync(retryCount: /* your preferred retry count */, 
         sleepDurationProvider: (retryAttempt, context) => 
             context.ContainsKey("RetryAfter") && context["RetryAfter"] != null
             ? (TimeSpan)context["RetryAfter"] 
             : defaultSleepProvider(retryAttempt)  // detail elided
  );

Unfortunately, this particular item seems to be more pseudocode than a usable example - there is no such named parameter as sleepDurationProvider for RetryAsync. So, I have tried WaitAndRetry as follows:

            var honoursRetryAfterAsync = Policy<HttpResponseMessage>
                .Handle<Exception>()
                .OrResult(r => r.StatusCode.ToString().Contains("429"))
                .WaitAndRetryAsync(                    
                    retryCount: 3,
                    sleepDurationProvider: (retryAttempt, context) => context.ContainsKey("RetryAfter") && context["RetryAfter"] != null
                                                                          ? (TimeSpan)context["RetryAfter"]
                                                                          : TimeSpan.FromSeconds(0.5),
                    onRetryAsync: async (message, timespan, retryAttempt, context) => await DoSomething(context["httpClient"]));

However, it is becoming confusing, as there are so many overrides with different signatures, and their availability depends on the complete signature of the complete Policy, and some of the examples in the documentation do not compile. e.g:

Policy
  .HandleResult<HttpResponseMessage>(r => r.StatusCode == 500) // Error - cannot use equality between r.StatusCode and int
  .OrResult<HttpResponseMessage>(r => r.StatusCode == 502) // Error - should not specify <HttpResponseMessage> again

So, I am feeling as If I am just hacking things together in an inappropriate way 😉

I would really appreciate it if someone can point me in the right direction to get this feature working, as intended.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 32 (17 by maintainers)

Most upvoted comments

Hi,

Thanks @reisenberger.

For next one who are looking for a sample code:

            private readonly RetryPolicy _pollyRetryPolicy = Policy
                .Handle<DocumentClientException>(e => e.RetryAfter > TimeSpan.Zero)
                .WaitAndRetryAsync(
                    retryCount: 3,
                    sleepDurationProvider: (i, e, ctx) =>
                    {
                        var dce = (DocumentClientException)e;
                        return dce.RetryAfter;
                    },
                    onRetryAsync: (e, ts, i, ctx) => Task.CompletedTask
                );

Let me know if you see something to improve 😃

Thanks @reisenberger for clarity! I had missed that point. Just wanna also say thanks for the quality of your responses throughout this thread, helped me learn better how Polly works.

What I learned from studying this thread, and the call out to Microsoft Support, was that 30 secs of retry is fine for our use case. Any more than that and we need to just kick down and pay for more RU’s 😃

@mcquiggd Polly v5.6.0 (uploading to nuget in next day or so) will include the ability for WaitAndRetry policies to define a sleepDurationProvider which can take the handled fault (exception; or returned result) as an input parameter to the sleepDurationProvider delegate.

This should make it even easier to use WaitAndRetry policies to wait for a ‘retry after’ duration specified by the called system; eg as CosmosDB does with HTTP response code 429 and response headers

@mcquiggd

B1/B2 Handled results: When a policy is a strongly-typed generic Policy<TResult>, executions through the policy can only return results of type TResult. Given that, we should be able to pass context.LastHandledResult as TResult.

It would mean creating a new generic-typed Context<TResult> : Context. We would need to check for breaking changes, but given Context<TResult> : Context, my initial (shallow) assessment is that breaking changes would be avoided.

(thoughts on other categories coming separately)

@mcquiggd More detailed proposal for handling InnerExceptions of AggregateException, now under #256 : comment welcome!

@reisenberger

I should have been more precise in my previous question. Q in my mind was whether any one execution through DocumentDb might fail with either a DocumentClientException or AggregateException (as opposed to DocumentDb expressing that variety if called in different ways)

Yep, got it - thats why I wanted to look at the Transient Fault Handler code and Document Client code to see under what circumstances (and why) there are Aggregate exceptions. I put the background for others reading the thread 😃

It would be a perfectly reasonable theory that the Async methods on DocumentClient throw AggregateExceptions, and the non-async methods throw DocumentClientExceptions. But as I say, now its time to code / examine code and see how things work 😉

@mcquiggd

B1/B2 Handling multiple exception types: Yes, a single policy can handle multiple unrelated exception types (say, TException and VException). And this power has some relatively common and useful Use Cases; eg using HttpClient w/ async, it’s common to want to handle both HttpRequestException and TaskCanceledException (for timeouts).

That power does make providing a strongly-typed TException to eg onRetry not a straight fit.

Thinking around 
 The only option (which would be a major API change) would be moving the onRetry-style delegates of Polly into the .Handle<>(...) clause. But it carries other disadvantages. It weakens the meaning of the delegates (would dissociate them from their current, nicely-stated onBreak, onRetry, onFallback role). And it would introduce code duplication for the relatively common Use Case, that you simply want to log something around the exception - that it eg caused the retry or circuit breakage. Not altogether satisfactory in different ways


At this stage, I’m left thinking that the need (sometimes) to disambiguate which exception you have received is a necessary corollary of the power to handle multiple exception types, encapsulated nicely in a single Policy. Open to thoughts, though, from anyone, if we’re missing some clever alternative?

(This discussion excludes AggregateException - comment following)

EDIT: Another option that does exist, with existing Polly, is to specify separate policies for each exception type. You can use the same policy type more than once in a wrap. Where the onRetry handling was known to want to differ for different exception types, this provides a mechanism. For example, this pattern is commonly used to set up one retry policy for re-authentication, and another for transient faults.

So, you’d have access to eg, context.LastHandledException and context.LastHandledResult - would this do the trick?

Very nicely, and I was going to suggest that - the existing number of overloads was a little intimidating at first, and of course changing the execution behaviour and breaking backwards compatibility is a non-starter.

@mcquiggd

A5. So, we could do this (add sleepDurationProvider overloads taking exception, handled result), but I’m favouring adding the Exception / handled result to Context instead to avoid proliferating further overloads (and opens up potential for other Use Cases profiting from the same info).

So, you’d have access to eg, context.LastHandledException and context.LastHandledResult - would this do the trick?

( Great to have this conversation and see the DocumentDb use case in detail: thanks for sharing! )

@mcquiggd Great to see this all coming together!

👍 using the Context to pass things (operation name; Id) into the call - exactly what intended for.

A1. All Polly policies are fully thread-safe, so you’re absolutely fine re-using an instance across multiple calls (if relevant to any decisions around this).

A2. Whether you want to continueOnCapturedContext or not, be driven simply by whether better for your Use Case / usage with DocumentDB (usual considerations around how it influences async context marshalling). Polly doesn’t care either way. continueOnCapturedContext == true usually only necessary if you specially need execution after the ExecuteAsync(...) to continue on the same context as before (some actor frameworks; or need preserve UI execution context).

A3/A4. Your analysis is exactly correct. Sleep duration is calculated before onRetry/Async is invoked - even though only used after - because it’s an input parameter to onRetry/Async. I’ve updated the Retry wiki to make this more accurate.

Foresaw your A3 problem as soon as you posted this, btw, but 
 busy day this end 
 you beat me to the follow-up 😀 and figured it out, so great 👍 . This exact issue is why my example the day before was more circuitous!

( A5. comments in mo )

@reisenberger

Made some changes this morning, including storing the id of the Document to be inserted via Context when ExecuteAsync is called, and that functions as expected, with data subsequently being available within the Policy.

A1. I changed the Policy declaration to be simply private - i.e. not static, not read only - I observed no change in behaviour.

A2. I have also set the parameter continueOnCapturedContext on ExecuteAsync to true - this does not appear to have any impact on results.

A3. I have verified that on retry attempt 1, there is no RetryAfter being returned in the Context, although it is returned from the server.

A4. On investigation, this is due to the sleepDurationProvider being called before the onRetryAsync delegate, which is not what I expected from the Retry lifecycle :

If another try is permitted, the policy: Raises the onRetry delegate (if configured) In the case of a wait-and-retry policy, calculates the duration to wait from the IEnumerable<TimeSpan> or Func<int, TimeSpan> configured, and waits.

pollysettingretryinterval4

A5. I don’t see an override of the sleepDurationProvider that provides access to the Exception / handled object - which would allow simplified logic in this instance.

Any thoughts
?

@reisenberger

Well, I have made a little progress.

Firstly, I have taken the DocumentDB Benchmark project (available from the DocumentDB GitHub repo, under samples), copied it and converted it to .Net Core, so we have both .Net 4.x and .Net Core 1.x versions in the same solution.

I am using this as a testbed for my ‘experiments’, with the latest DocumentDB emulator set to use a single, non-partitioned collection of 400 Request Unit capacity, the ‘slowest’.

I deliberately set the values for threads, and amount of documents to insert, to cause throttling.

With the DocumentDB Client retry, throttling was handled automatically.

Next, I disabled the DocumentDB Client retries; of course exceptions occurred. Bear in mind that these settings are ‘global’ if you follow the sound advice of using a single Client instance, which does a lot of performance enhancement, but is also a little inflexible


private static readonly ConnectionPolicy ConnectionPolicy = new ConnectionPolicy 
{ 
    ConnectionMode = ConnectionMode.Direct, 
    ConnectionProtocol = Protocol.Tcp, 
    RequestTimeout = new TimeSpan(1, 0, 0), 
    MaxConnectionLimit = 1000, 
    RetryOptions = new RetryOptions 
    { 
        MaxRetryAttemptsOnThrottledRequests = 0, // was 10
        MaxRetryWaitTimeInSeconds = 60
    } 
};

Then, I created a Polly Retry Policy, as follows:

private static readonly RetryPolicy PollyRetryPolicy = Policy.Handle<DocumentClientException>(de => de.RetryAfter > TimeSpan.Zero).
        Or<AggregateException>(ae => ((ae.InnerException as DocumentClientException)?.RetryAfter ?? TimeSpan.Zero) > TimeSpan.Zero).
        WaitAndRetryAsync(retryCount: 3, 
                sleepDurationProvider: (retryAttempt, context) =>
                {
                    var retryAfter = TimeSpan.FromMilliseconds(10); // Should be 0? 
                    if (context == null || !context.ContainsKey("RetryAfter")) return retryAfter;                       
                    return (TimeSpan)context["RetryAfter"];
                }, 
                onRetryAsync: (exception, timespan, retryAttempt, context) =>
                {
                    context["RetryAfter"] = (exception as DocumentClientException)?.RetryAfter
                                            ?? (exception.InnerException as DocumentClientException)?.RetryAfter 
                                                 ?? TimeSpan.FromMilliseconds(1); //Just an arbitrary chosen value, never seems to be used
                    return Task.CompletedTask;
                }                
        );

So, basically using Context to pass the retry value from the onRetryAsync to the sleepDurationProvider, and using purely Polly to handle waits etc. And now we can create different policies / settings for different ‘operations’, using the same Client instance, potentially passing the ‘operation name’ in Context from the call to Execute the Policy
 if my understanding of the Retry life-cycle is correct. I have noticed that retry attempt 1 does not seem to include a server retry-after value, but had limited time today to verify.

This Policy is then called in the multi-threaded benchmarking app as below:

var response  = await PollyRetryPolicy.ExecuteAsync(async () => await client.CreateDocumentAsync(
                                                                    UriFactory.CreateDocumentCollectionUri(_appSettings.DatabaseName, _appSettings.CollectionName),
                                                                    newDictionary,
                                                                    new RequestOptions()));

The result - Polly handles all the retries, observing the returned retry interval from the server:

pollysettingretryinterval

I’ll continue to work on this (it can definitely be tidied up, I have a lot of distractions around at the moment), and upload the Solution to Github in case others wish to contribute.

Perhaps you can spot any obvious mistakes - e.g. is it valid to create the Policy as static readonly in a multithreaded environment?

I want to create some more advanced examples for the different scenarios available with Polly, along the lines of your last post
 will pop onto Slack at some point


Thanks for the thoughtful commentary @mcquiggd ^^ (two up). Agree on all counts!

Re your point 3 - wrapping a bunch of policies around a call as a small abstraction - check out Polly’s PolicyWrap (if not already seen it), designed for exactly this purpose. We also provide guidelines on ordering the policies within a wrap, although PolicyWrap is intentionally fully flexible, so any ordering can be used.

+1 to your idea of:

architect a common set of Policies and patterns for handling errors - be they Blob Storage, Azure Search, DocumentDB, DynamoDB, ElasticSearch.

We would love to hear more about this later, if grows and you would like to share. We would love to eg blog/guest-blog about it, or share samples perhaps (duly credited) via a Polly.Contrib. (Contact me on Polly slack later if u want to discuss.)

One issue with the in-built retries for the various Azure services is that they are all different APIs - good in the sense each is probably a good fit/strategy for the particular service - but a slightly fragmented API experience. Another issue with combining Polly policies with Azure’s in-built retries, is that with Polly-wraps-Azure-retry you can effectively only have the retry innermost (out of the set of policies). However, you might want to have certain policy types inside a retry, in some PolicyWraps.

Re Topaz and the Enterprise Application Blocks, yes, feels like it is well on the way out. The Microsoft patterns and practices team are now moving away from recommending Topaz, in favour of Polly (I’ve provided input on the Polly samples).

24 hours is a long time in technology - now we have Azure Cosmos DB
 đŸ„‡

From the documentation:

If you have more than one client cumulatively operating above the request rate, the default retry behavior may not suffice, and the client will throw a DocumentClientException with status code 429 to the application. In cases such as this, you may consider handling retry behavior and logic in your application’s error handling routines or increasing the reserved throughput for the container.

Having looked at the DocumentClient source code, I am currently testing an approach that disables the built-in retries, and uses Polly instead. Will report back if I make progress.

@mcquiggd Thank you also for pointing out the slips in the readme doco. Now fixed in #248 .

@mcquiggd. Yes, that was pseudo-code-ish (have re-labelled). Thanks for identifying the issues! Let’s see if we can sort it out. I’d also been meaning to ask @jmansar (who originally requested the feature, #177) or @tcsatheesh (who was also interested in using it with DocumentDb) if they wanted to expand on my blogged (pseudo-) code: both, feel free to chip in!

More coming shortly.