IdGen: Debugging InvalidSystemClockExceptions

We’re running into sporadic InvalidSystemClockExceptions that end up requiring restarts, and we’re running out of ideas trying to find where it goes wrong.

one error example is: IdGen.InvalidSystemClockException: Clock moved backwards or wrapped around. Refusing to generate id for 4294446821 ticks from reading the code, this implies the new timestamp is behind the last generated one by 50 days, which seems too big of a difference even if there was a clock sync issue (we’ve checked and our k8s nodes are running chrony and the recorded sync errors are nowhere near that much of a difference).

The issue seems to happen once a month or so, and forces us to manually restart pods as it blocks all writes, and waiting the ticks for it to restart would take days.

Our timesource implementation

public class ClockTimeSource : ITimeSource
{
    private long previousTimestamp = 0;

    public DateTimeOffset Epoch { get; }
    public TimeSpan TickDuration => TimeSpan.FromMilliseconds(1);

    public ClockTimeSource() { }
    public ClockTimeSource(DateTimeOffset epoch) => Epoch = epoch;

    public long GetTicks()
    {
        var timestamp = (long)(DateTimeOffset.UtcNow - Epoch).TotalMilliseconds;
        if (timestamp > previousTimestamp)
        {
            previousTimestamp = timestamp;
        }
        return previousTimestamp;
    }
}

Would appreciate any pointers on where to start debugging this.

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Yes, but is IdGen the only thing accessing your ClockTimeSource? Or are you using the ClockTimeSource for other purposes as well?

@hugh-maaskant great spot! 👍

Hi, just chiming in here …

If GetTicks() is called from multiple threads, I believe there also are two problems in its implementation:

  1. The check timestamp > previousTimestamp and the assignment of the new timestamp value need to be atomic. In the really unlikely case that the thread checking is scheduled out just after the check and then it takes a long time to get rescheduled (more than the tick time - so as I said unlikely) than it could reassign a lower timestamp value back to the previousTimestamp that was meanwhile update twice by other threads
  2. The assignment of a long value itself is not atomic on all processor architectures, so (again unlikely) it could get the upper half of one value and the lower half of another value (from another thread).

Both of these are easily (and fairly cheaply) avoided by using a lock() statement: https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/statements/lock.

Good luck 😃

If you find out the problem, please let me know? I’m curious to what it was.

Also out of curiosity: your epoch is 2022-10-22 23:04:27 +00:00. This is intentional? (Just checking because it’s a bit of a weird date/time to be an epoch…).