Polly: Decorrelated jitter does not jitter well

Summary: Jitter the retry delays to achieve low concurrency



I’m trying to use the suggested “decorrelated jitter” strategy

jittering

In my experiment, the suggested decorrelated jitter does not achieve what it supposed to.

First, new Random() is not quite random, when it comes to reducing concurrency. You can get multiple instances with the same seed, if many retries are issued simultaneously.

Following code achieves much lower concurrency and better distribution. In the attached screenshot you can see that the ‘advanced’ graph has much lower maximum and smoother distribution, which is highly desired to avoid concurrency spikes.

IEnumerable<TimeSpan> DecorrelatedExponent()
{
    var r = new Random(Guid.NewGuid().GetHashCode());
    for (var softAttemptCount = 0.0; softAttemptCount < 4; ) {
        softAttemptCount += r.NextDouble() * 2;
        yield return TimeSpan.FromSeconds(
            Math.Pow(2, softAttemptCount) * random.NextDouble());
    }
}

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 41 (21 by maintainers)

Most upvoted comments

@reisenberger Algorithm by AWS also has a flaw. On my graphs it looks smoother, but still has spikes, because it’s not truly decorrelated.

@grant-d The problem with clamping using Min Max is that it ruins randomness, flooding the distribution with some constants. As for the API guarantees, you can use math to proof the MAX and MIN. For ex. max recovery interval for retry is a sum of geometric progression, not matter if you randomize inside, if only limited range random variable does not participate in any asymptotic math function such as 1/x.

My graphing framework turned to be very useful to analyze the problem. I will share it as I promised as I clean up the code.

Currently busy preparing for my DotNext talk this week. I will follow up next week. I’ve put substantial effort into exploring this, and I will make sure we use strong math and also we will have good visual proof of what’s happening in our formulas.

@george-polevoy Are you happy if we include your ‘soft exp, soft delay X’ algorithm in the Polly.Contrib.WaitAndRetry repo? (We can do all the necessary work to move the code there.)

@grant-d Yes, I mean immediately.

Instead of minimum delay, there could be half decay period parameter, which is more meaningful for the exponential function. This would be a point in time when probability drops to 1/2. Exponential property guarantees that the amount of requests going before the half decay would equal to the amount of request coming after (green and purple on the graph have the same area).

https://www.desmos.com/calculator/qmqaapanvc

image

This revisits my previous comment on implementations of System.Random, to provide promised references and clarify for the sake of accuracy and future readers:

the current .NET core implementation seems to use a static global instance of Random to provide random seeds for [ThreadStatic] instances of Random, which in turn are used to fulfil new Random() called on any thread. This approach was supposed (according to various posts by Jon Skeet and Stephen Toub; will try to find links again) … supposed to improve randomisation behaviour for new Random() called on any thread.

References: post by Stephen Toub; Jon Skeet 1; Jon Skeet 2.

There are discussions on changing the algorithm for .NET Core/Standard, but no changes seem to have landed yet, afaics.

The latest .NET framework and .NET core implementations do in fact differ, because the .NET Framework 4.7.2 implementation still (!- for backward compatibility presumably) has Environment.TickCount as a seed new Random(), whereas .NET Core has moved on to a chained-RNGs approach. (The discussions on changing the algorithm were around the Knuth algorithm vs others, rather than the seed.)

However, Jon Skeet seems to have rowed back from [the chained-RNG approach] and the latest .NET Framework recommendation is to use a single instance application-wide (for maximum randomness), but with fine-grained locking or similar (because a single instance is not thread-safe for simultaneous access from multiple threads).

The recommendation I linked pertains to .NET Framework not .NET Core. However, in targeting .NET Standard, Polly of course can still (as is well known) be consumed by builds against .NET Framework 4.5 and up. So we should still take account of the .NET Framework recommendation (EDIT as PR #536 does) if for jitter we are targeting maximum randomness.


Broadly, it is essential to make Random thread-safe, and these discussions embrace two different approaches to that which exhibit different trade-offs:

  • a fine-grained-lock on a single static instance of Random
    • emphasizes improved randomisation but introduces fine-grained locking (per acquisition of each random number), which could affect perf in high-concurrency scenarios.
    • recommend for Polly’s Jitter use case: improved randomisation is important, to avoid spikes; delay due to lock contention is insignificant in the context of introducing jitter delays and executions which have already failed anyway.
  • a chained-RNGs approach with thread-safety achieved by [ThreadStatic] or ThreadLocal<T>
    • doubts over its randomisation credentials being as strong as the preceding approach, but improved perf in high-concurrency scenarios due to no locking.
    • recommend for Polly’s chaos-injection use case; we would want randomised chaos-injection to have absolutely minimal drag on executions which weren’t randomly selected for chaos injection, so avoidance of locking is preferred; that consideration is more important than how randomly the chaos-impacted executions are selected.

Alternatively, @grant-d , @george-polevoy , we could eliminate the upper-end-bunching by modifying the latest implementation like this:

        double ms = MinDelay.TotalMilliseconds;
        for (; i < retryCount; i++)
        {
            double ceiling = Math.Min(MaxDelay.TotalMilliseconds, ms * 3);
            ms = _random.Uniform(MinDelay.TotalMilliseconds, ceiling);

            yield return TimeSpan.FromMilliseconds(ms);
        }

Is this worth modelling (to confirm smoothness) in George’s graphing system? Can anyone see disadvantages? One implication is that when ms * 3 > MaxDelay.TotalMilliseconds, there is no longer any in-built increase to the backoff. It speaks to the same problem previously discussed, that users do need to choose MaxDelay significantly proportionally larger than seed. However, that lack of increasing backoff may be preferable to the upper-end bunching.

I see the problem with your code.

In addition to min(cap, …) that AWS team suggested in their blog, you’ve also added max(seed, …).

Say we have 100 requests failed at time t0. Seed = 1s, and max=3s.

What is the probability of scheduling a request at time t0 + 1s? It’s 1/3, because all of the values lower than t0+1s clamp. So you’ve got aprox. 33 requests at exact timing: t0 + 1s. All of the other requests will be distributed uniformly. The next clamp will appear at t0 + 2s, and concurrency will be lower (1/3^2 = 1/9 = aprox. 11) at exact timing t0 + 2s.

That completely explains the fist two spikes on the graph, which are clearly seen.

I’ve examined the code on the AWS blog, and they have slightly different thing. They use random_between, which is not max(something, random.NextDouble). Random between should be a linear mapping, not limiting with Max. Replace it with RandomBetween(x, y) = x + r.NextDouble() * (y-x) and the problem is gone.

I will follow up with the modelling code i’m using.

@grant-d Randomness is not an issue here. I will share code for you to verify. My computations are now single threaded and deterministic, using a fixed random seed and a shared Random instance, not using Polly. Also, system clock issues can be ruled out, because this is just purely math integration based on deterministic pseudorandom sequence. The entire computation does not depend on runtime issues of NET, it’s computed offline. It’s just a math routine. I’m computing the delay sequence function itself using virtual regular time intervals.

Circling back to our use case: given we are calculating randomised delays of far greater order of magnitude than any delay contention due to fine-grained locking, I guess we can safely follow the advice and not care about the impact of locking.

@reisenberger Dylan, sure there was a mistake. It was concurrency that is improved, not latency, I’ve updated the initial issue.

The vertical axis represents number of experiments. Virtual massive retry event of simultaneous 100000 requests could result in concurrency splash, which is illustrated. If I try to graph fewer experiments, the graph is just too noisy. I understand that the number of experiments is irrelevant, so it would be usefult to factor it out, and display just some normalized probability distribution.

I will try to come up with complete benchmark code, so others could explore and experiment.

George, would you mind trying your experiment with the Random taken out of the method and rather declared as a private static readonly Random on the class? That way we share it across concurrent retries so won’t get the same seed (Datetime.Now.Ticks) . It would be interesting to see how the graph compares. Also, see my abstraction in #526. If you are perhaps willing to use it in your test harness? Note that it too should be instantiated as a shared singleton, with each retry calling the factory Create method on it. See the sample code.