nats-server: JetStream first sequence mismatch error keeps occuring for every restart/deployment
Defect
Make sure that these boxes are checked before submitting your issue – thank you!
- Included
nats-server -DV
output - Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)
*Note: trace logging not enabled, since to reproduce this issue messaging needs to be constant (both adds and deletes), and this spams the logged output. Debug logging is included.
Versions of nats-server
and affected client libraries used:
- nats-server 2.9.2, running
nats:2.9.2-alpine
- nats-box, running
natsio/nats-box:0.13.0
(no change made in Helm chart)
OS/Container environment:
- AWS EKS cluster
- NATS instances running on separate
c6i.xlarge
machines, clustered with 3 nodes - nats-box on a separate machine (in our setup on a
r5.large
) - Using Helm with https://github.com/nats-io/k8s, with minimal config:
cluster:
enabled: true
replicas: 3
nats:
image: nats:2.9.2-alpine
logging:
debug: true
jetstream:
enabled: true
memStorage:
enabled: true
size: 5Gi
fileStorage:
enabled: true
size: 30Gi
storageClassName: gp3-retained
resources:
requests:
cpu: 3
memory: 6Gi
limits:
# cpu: 4
memory: 8Gi
# Sets GOMEMLIMIT environment variable which makes the Go GC be aware of memory limits
# from the container. Recommended to be set to about 90% of the resource memory limits.
gomemlimit: 7GiB
*Note: where gp3-retained
is a Kubernetes StorageClass using gp3
disks and Retain
reclaim policy.
Steps or code to reproduce the issue:
- Setup file-based R3 stream, with a TTL (in this example a TTL of 1 minute, but in our real setup it’s a 1 hour TTL, just so there is no need to wait for a long time):
nats str add benchstream --storage=file --replicas=3 --retention=limits --max-age=1m --max-bytes=-1 --max-consumers=-1 --max-msg-size=-1 --max-msgs=-1 --max-msgs-per-subject=-1 --discard=old --dupe-window=1m --subjects=js.bench --allow-rollup --deny-delete --no-deny-purge
- Start benchmark:
nats bench js.bench --pub=1 --msgs=10000000 --pubsleep=0.00001s --size=1KB --pubbatch=1000
- The
benchstream
stream persists messages for 1 minute, after which it will also start removing messages. The benchmark should keep running for a long time. After at least a minute has past; remove 1 instance (not sure if there is a difference in being a leader or not, tested by restarting a non-leader instance) - While the instance is restarting the following messages will be displayed:
[7] 2022/10/03 13:43:30.411232 [WRN] Error applying entries to '$G > benchstream': first sequence mismatch
[7] 2022/10/03 13:43:30.414851 [WRN] Resetting stream cluster state for '$G > benchstream'
Expected result:
The restarted instance should only need to catch up with the messages that it missed during it’s downtime.
In this example a TTL of 1 minute is used, in our setup it’s a 1 hour TTL, but with the same result.
Actual result:
The stream state is lost for this replica, and it needs to be reconstructed. Taking more time to perform the restart/deployment.
Files:
Before restart: nats_logs_before.txt After restart: nats_logs_after.txt
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (19 by maintainers)
Commits related to this issue
- [FIXED] JetStream: unresponsiveness while creating raft group Originally, createRaftGroup() would not hold the jetstream's lock for the whole duration. But some race reports made us change this funct... — committed to nats-io/nats-server by kozlovic 2 years ago
- [FIXED] JetStream: "first sequence mismatch" error on catchup with message expiration When a server was restarted and expired messages, but the leader had a snapshot that still had the old messages w... — committed to nats-io/nats-server by kozlovic 2 years ago
Since upgrading to 2.9.5 it all started working perfectly, thanks again! (we’ve upgraded to 2.9.6 already, still working like a charm)
I can confirm that we are seeing this issue as well. Thanks for reporting it @MauriceVanVeen and for fixing it @kozlovic.
Ok, I have a server test case that reproduces the issue. I will discuss with @derekcollison next week to see what are the next steps. Thank you for your patience.
FYI: I was able to reproduce the “first sequence mismatch” resulting in resetting the cluster using manual tests. With debugging, I think I can understand what is going on and will later try to write a unit test that can reproduce. I am not sure there is much we can do to prevent that, but we will discuss internally with @derekcollison to see if there is a way to mitigate.
I am trying to write a server test that would reproduce the issue. This error means that the snapshot the node receives has a first sequence that is lower than the first sequence in its store, so the node needs to clear the state. Suppose leader has a snapshot that says first sequence is 100 and last is 1000, but the follower that receives this snapshot has its first sequence being 150, then it assumes that it is missing messages 100 to 149 and due to file store being a message log (append-only), then it needs to wipe it states and sync with the leader. My suspicion is because of message TTL, there is a race where this snapshot is actually no longer valid in that the follower may have already discarded “old” messages? So far I have not been able to write a “simple” server test that demonstrates the issue. I will update when I am able to do so.