nats-server: JetStream first sequence mismatch error keeps occuring for every restart/deployment

Defect

Make sure that these boxes are checked before submitting your issue – thank you!

*Note: trace logging not enabled, since to reproduce this issue messaging needs to be constant (both adds and deletes), and this spams the logged output. Debug logging is included.

Versions of nats-server and affected client libraries used:

  • nats-server 2.9.2, running nats:2.9.2-alpine
  • nats-box, running natsio/nats-box:0.13.0 (no change made in Helm chart)

OS/Container environment:

  • AWS EKS cluster
  • NATS instances running on separate c6i.xlarge machines, clustered with 3 nodes
  • nats-box on a separate machine (in our setup on a r5.large)
  • Using Helm with https://github.com/nats-io/k8s, with minimal config:
cluster:
  enabled: true
  replicas: 3

nats:
  image: nats:2.9.2-alpine

  logging:
    debug: true

  jetstream:
    enabled: true

    memStorage:
      enabled: true
      size: 5Gi

    fileStorage:
      enabled: true
      size: 30Gi
      storageClassName: gp3-retained

  resources:
    requests:
      cpu: 3
      memory: 6Gi
    limits:
      # cpu: 4
      memory: 8Gi

  # Sets GOMEMLIMIT environment variable which makes the Go GC be aware of memory limits
  # from the container. Recommended to be set to about 90% of the resource memory limits.
  gomemlimit: 7GiB

*Note: where gp3-retained is a Kubernetes StorageClass using gp3 disks and Retain reclaim policy.

Steps or code to reproduce the issue:

  • Setup file-based R3 stream, with a TTL (in this example a TTL of 1 minute, but in our real setup it’s a 1 hour TTL, just so there is no need to wait for a long time):
nats str add benchstream --storage=file --replicas=3 --retention=limits --max-age=1m --max-bytes=-1 --max-consumers=-1 --max-msg-size=-1 --max-msgs=-1 --max-msgs-per-subject=-1 --discard=old --dupe-window=1m --subjects=js.bench --allow-rollup --deny-delete --no-deny-purge
  • Start benchmark:
nats bench js.bench --pub=1 --msgs=10000000 --pubsleep=0.00001s --size=1KB --pubbatch=1000
  • The benchstream stream persists messages for 1 minute, after which it will also start removing messages. The benchmark should keep running for a long time. After at least a minute has past; remove 1 instance (not sure if there is a difference in being a leader or not, tested by restarting a non-leader instance)
  • While the instance is restarting the following messages will be displayed:
[7] 2022/10/03 13:43:30.411232 [WRN] Error applying entries to '$G > benchstream': first sequence mismatch
[7] 2022/10/03 13:43:30.414851 [WRN] Resetting stream cluster state for '$G > benchstream'

Expected result:

The restarted instance should only need to catch up with the messages that it missed during it’s downtime.

In this example a TTL of 1 minute is used, in our setup it’s a 1 hour TTL, but with the same result.

Actual result:

The stream state is lost for this replica, and it needs to be reconstructed. Taking more time to perform the restart/deployment.

Files:

Before restart: nats_logs_before.txt After restart: nats_logs_after.txt

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (19 by maintainers)

Commits related to this issue

Most upvoted comments

Since upgrading to 2.9.5 it all started working perfectly, thanks again! (we’ve upgraded to 2.9.6 already, still working like a charm)

I can confirm that we are seeing this issue as well. Thanks for reporting it @MauriceVanVeen and for fixing it @kozlovic.

Ok, I have a server test case that reproduces the issue. I will discuss with @derekcollison next week to see what are the next steps. Thank you for your patience.

FYI: I was able to reproduce the “first sequence mismatch” resulting in resetting the cluster using manual tests. With debugging, I think I can understand what is going on and will later try to write a unit test that can reproduce. I am not sure there is much we can do to prevent that, but we will discuss internally with @derekcollison to see if there is a way to mitigate.

I am trying to write a server test that would reproduce the issue. This error means that the snapshot the node receives has a first sequence that is lower than the first sequence in its store, so the node needs to clear the state. Suppose leader has a snapshot that says first sequence is 100 and last is 1000, but the follower that receives this snapshot has its first sequence being 150, then it assumes that it is missing messages 100 to 149 and due to file store being a message log (append-only), then it needs to wipe it states and sync with the leader. My suspicion is because of message TTL, there is a race where this snapshot is actually no longer valid in that the follower may have already discarded “old” messages? So far I have not been able to write a “simple” server test that demonstrates the issue. I will update when I am able to do so.