Polly: Decorrelated jitter does not jitter well
Summary: Jitter the retry delays to achieve low concurrency
I’m trying to use the suggested “decorrelated jitter” strategy
In my experiment, the suggested decorrelated jitter does not achieve what it supposed to.
First, new Random()
is not quite random, when it comes to reducing concurrency.
You can get multiple instances with the same seed, if many retries are issued simultaneously.
Following code achieves much lower concurrency and better distribution. In the attached screenshot you can see that the ‘advanced’ graph has much lower maximum and smoother distribution, which is highly desired to avoid concurrency spikes.
IEnumerable<TimeSpan> DecorrelatedExponent()
{
var r = new Random(Guid.NewGuid().GetHashCode());
for (var softAttemptCount = 0.0; softAttemptCount < 4; ) {
softAttemptCount += r.NextDouble() * 2;
yield return TimeSpan.FromSeconds(
Math.Pow(2, softAttemptCount) * random.NextDouble());
}
}
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 41 (21 by maintainers)
@reisenberger Algorithm by AWS also has a flaw. On my graphs it looks smoother, but still has spikes, because it’s not truly decorrelated.
@grant-d The problem with clamping using Min Max is that it ruins randomness, flooding the distribution with some constants. As for the API guarantees, you can use math to proof the MAX and MIN. For ex. max recovery interval for retry is a sum of geometric progression, not matter if you randomize inside, if only limited range random variable does not participate in any asymptotic math function such as 1/x.
My graphing framework turned to be very useful to analyze the problem. I will share it as I promised as I clean up the code.
Currently busy preparing for my DotNext talk this week. I will follow up next week. I’ve put substantial effort into exploring this, and I will make sure we use strong math and also we will have good visual proof of what’s happening in our formulas.
@george-polevoy Are you happy if we include your ‘soft exp, soft delay X’ algorithm in the Polly.Contrib.WaitAndRetry repo? (We can do all the necessary work to move the code there.)
@grant-d Yes, I mean immediately.
Instead of minimum delay, there could be half decay period parameter, which is more meaningful for the exponential function. This would be a point in time when probability drops to 1/2. Exponential property guarantees that the amount of requests going before the half decay would equal to the amount of request coming after (green and purple on the graph have the same area).
https://www.desmos.com/calculator/qmqaapanvc
This revisits my previous comment on implementations of
System.Random
, to provide promised references and clarify for the sake of accuracy and future readers:References: post by Stephen Toub; Jon Skeet 1; Jon Skeet 2.
The latest .NET framework and .NET core implementations do in fact differ, because the .NET Framework 4.7.2 implementation still (!- for backward compatibility presumably) has
Environment.TickCount
as a seednew Random()
, whereas .NET Core has moved on to a chained-RNGs approach. (The discussions on changing the algorithm were around the Knuth algorithm vs others, rather than the seed.)The recommendation I linked pertains to .NET Framework not .NET Core. However, in targeting .NET Standard, Polly of course can still (as is well known) be consumed by builds against .NET Framework 4.5 and up. So we should still take account of the .NET Framework recommendation (EDIT as PR #536 does) if for jitter we are targeting maximum randomness.
Broadly, it is essential to make
Random
thread-safe, and these discussions embrace two different approaches to that which exhibit different trade-offs:Random
[ThreadStatic]
orThreadLocal<T>
Alternatively, @grant-d , @george-polevoy , we could eliminate the upper-end-bunching by modifying the latest implementation like this:
Is this worth modelling (to confirm smoothness) in George’s graphing system? Can anyone see disadvantages? One implication is that when
ms * 3 > MaxDelay.TotalMilliseconds
, there is no longer any in-built increase to the backoff. It speaks to the same problem previously discussed, that users do need to chooseMaxDelay
significantly proportionally larger thanseed
. However, that lack of increasing backoff may be preferable to the upper-end bunching.I see the problem with your code.
In addition to min(cap, …) that AWS team suggested in their blog, you’ve also added max(seed, …).
Say we have 100 requests failed at time t0. Seed = 1s, and max=3s.
What is the probability of scheduling a request at time t0 + 1s? It’s 1/3, because all of the values lower than t0+1s clamp. So you’ve got aprox. 33 requests at exact timing: t0 + 1s. All of the other requests will be distributed uniformly. The next clamp will appear at t0 + 2s, and concurrency will be lower (1/3^2 = 1/9 = aprox. 11) at exact timing t0 + 2s.
That completely explains the fist two spikes on the graph, which are clearly seen.
I’ve examined the code on the AWS blog, and they have slightly different thing. They use random_between, which is not max(something, random.NextDouble). Random between should be a linear mapping, not limiting with Max. Replace it with RandomBetween(x, y) = x + r.NextDouble() * (y-x) and the problem is gone.
I will follow up with the modelling code i’m using.
@grant-d Randomness is not an issue here. I will share code for you to verify. My computations are now single threaded and deterministic, using a fixed random seed and a shared Random instance, not using Polly. Also, system clock issues can be ruled out, because this is just purely math integration based on deterministic pseudorandom sequence. The entire computation does not depend on runtime issues of NET, it’s computed offline. It’s just a math routine. I’m computing the delay sequence function itself using virtual regular time intervals.
Circling back to our use case: given we are calculating randomised delays of far greater order of magnitude than any delay contention due to fine-grained locking, I guess we can safely follow the advice and not care about the impact of locking.
@reisenberger Dylan, sure there was a mistake. It was concurrency that is improved, not latency, I’ve updated the initial issue.
The vertical axis represents number of experiments. Virtual massive retry event of simultaneous 100000 requests could result in concurrency splash, which is illustrated. If I try to graph fewer experiments, the graph is just too noisy. I understand that the number of experiments is irrelevant, so it would be usefult to factor it out, and display just some normalized probability distribution.
I will try to come up with complete benchmark code, so others could explore and experiment.
George, would you mind trying your experiment with the
Random
taken out of the method and rather declared as aprivate static readonly Random
on the class? That way we share it across concurrent retries so won’t get the same seed (Datetime.Now.Ticks
) . It would be interesting to see how the graph compares. Also, see my abstraction in #526. If you are perhaps willing to use it in your test harness? Note that it too should be instantiated as a shared singleton, with each retry calling the factoryCreate
method on it. See the sample code.