nats-server: JetStream first sequence mismatch error keeps occuring for every restart/deployment

Defect

Make sure that these boxes are checked before submitting your issue – thank you!

Included nats-server -DV output
Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

*Note: trace logging not enabled, since to reproduce this issue messaging needs to be constant (both adds and deletes), and this spams the logged output. Debug logging is included.

Versions of `nats-server` and affected client libraries used:

nats-server 2.9.2, running nats:2.9.2-alpine
nats-box, running natsio/nats-box:0.13.0 (no change made in Helm chart)

OS/Container environment:

AWS EKS cluster
NATS instances running on separate c6i.xlarge machines, clustered with 3 nodes
nats-box on a separate machine (in our setup on a r5.large)
Using Helm with https://github.com/nats-io/k8s, with minimal config:

cluster:
  enabled: true
  replicas: 3

nats:
  image: nats:2.9.2-alpine

  logging:
    debug: true

  jetstream:
    enabled: true

    memStorage:
      enabled: true
      size: 5Gi

    fileStorage:
      enabled: true
      size: 30Gi
      storageClassName: gp3-retained

  resources:
    requests:
      cpu: 3
      memory: 6Gi
    limits:
      # cpu: 4
      memory: 8Gi

  # Sets GOMEMLIMIT environment variable which makes the Go GC be aware of memory limits
  # from the container. Recommended to be set to about 90% of the resource memory limits.
  gomemlimit: 7GiB

*Note: where gp3-retained is a Kubernetes StorageClass using gp3 disks and Retain reclaim policy.

Steps or code to reproduce the issue:

Setup file-based R3 stream, with a TTL (in this example a TTL of 1 minute, but in our real setup it’s a 1 hour TTL, just so there is no need to wait for a long time):

nats str add benchstream --storage=file --replicas=3 --retention=limits --max-age=1m --max-bytes=-1 --max-consumers=-1 --max-msg-size=-1 --max-msgs=-1 --max-msgs-per-subject=-1 --discard=old --dupe-window=1m --subjects=js.bench --allow-rollup --deny-delete --no-deny-purge

Start benchmark:

nats bench js.bench --pub=1 --msgs=10000000 --pubsleep=0.00001s --size=1KB --pubbatch=1000

The benchstream stream persists messages for 1 minute, after which it will also start removing messages. The benchmark should keep running for a long time. After at least a minute has past; remove 1 instance (not sure if there is a difference in being a leader or not, tested by restarting a non-leader instance)
While the instance is restarting the following messages will be displayed:

[7] 2022/10/03 13:43:30.411232 [WRN] Error applying entries to '$G > benchstream': first sequence mismatch
[7] 2022/10/03 13:43:30.414851 [WRN] Resetting stream cluster state for '$G > benchstream'

Expected result:

The restarted instance should only need to catch up with the messages that it missed during it’s downtime.

In this example a TTL of 1 minute is used, in our setup it’s a 1 hour TTL, but with the same result.

Actual result:

The stream state is lost for this replica, and it needs to be reconstructed. Taking more time to perform the restart/deployment.

Files:

Before restart: nats_logs_before.txt After restart: nats_logs_after.txt

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 20 (19 by maintainers)

Commits related to this issue

[FIXED] JetStream: unresponsiveness while creating raft group Originally, createRaftGroup() would not hold the jetstream's lock for the whole duration. But some race reports made us change this funct... — committed to nats-io/nats-server by kozlovic 2 years ago
[FIXED] JetStream: "first sequence mismatch" error on catchup with message expiration When a server was restarted and expired messages, but the leader had a snapshot that still had the old messages w... — committed to nats-io/nats-server by kozlovic 2 years ago

Most upvoted comments

Since upgrading to 2.9.5 it all started working perfectly, thanks again! (we’ve upgraded to 2.9.6 already, still working like a charm)

MauriceVanVeen on Nov 7, 2022

I can confirm that we are seeing this issue as well. Thanks for reporting it @MauriceVanVeen and for fixing it @kozlovic.

jocrau on Oct 19, 2022

Ok, I have a server test case that reproduces the issue. I will discuss with @derekcollison next week to see what are the next steps. Thank you for your patience.

kozlovic on Oct 15, 2022

FYI: I was able to reproduce the “first sequence mismatch” resulting in resetting the cluster using manual tests. With debugging, I think I can understand what is going on and will later try to write a unit test that can reproduce. I am not sure there is much we can do to prevent that, but we will discuss internally with @derekcollison to see if there is a way to mitigate.

kozlovic on Oct 13, 2022

I am trying to write a server test that would reproduce the issue. This error means that the snapshot the node receives has a first sequence that is lower than the first sequence in its store, so the node needs to clear the state. Suppose leader has a snapshot that says first sequence is 100 and last is 1000, but the follower that receives this snapshot has its first sequence being 150, then it assumes that it is missing messages 100 to 149 and due to file store being a message log (append-only), then it needs to wipe it states and sync with the leader. My suspicion is because of message TTL, there is a race where this snapshot is actually no longer valid in that the follower may have already discarded “old” messages? So far I have not been able to write a “simple” server test that demonstrates the issue. I will update when I am able to do so.

kozlovic on Oct 4, 2022