nats-server: JetStream Cluster becomes inconsistent: catchup for stream stalled
Defect
Make sure that these boxes are checked before submitting your issue – thank you!
- Included
nats-server -DV
output - Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)
Versions of nats-server
and affected client libraries used:
nats:2.6.2-alpine
OS/Container environment:
k8s
Context
- I don’t know how this problem arises, but it happens every few hours or days at max.
- The production system this is captured on has a message rate of about 1/s, with a few bursts every few hours of maybe 5/s (at least according to Grafana).
- The cluster consists of three nodes, and a stream with 3 replicas.
- After hours/days, one of the nodes start emitting
Catchup for stream '$G > corejobs' stalled
- All consumers, where this node is the leader, stop getting messages.
- I dumped the jetstream data directory so you can get a look at the current state.
Steps or code to reproduce the issue:
- Start the cluster with the data and config provided here.
- Make sure,
nats-cluster-2
is not the leader. (usenats stream cluster step-down
if necessary) - Send a message to the stream:
nats pub core.jobs.notifications.SendPush foobar
- Observe the logs of
nats-cluster-2
(which now throws warnings)
Additionally:
check the message count in the stream; when nats-cluster-2
is the leader, it differs from when nats-cluster-0
or nats-cluster-1
are leaders.
Expected result:
Either:
- All three nodes have the same data.
- The node which can’t catch up is marked faulty and steps down from all leader positions
Also: JetStream should always be able to recover from inconsistent states (especially if there is still a majority of healthy nodes around).
Actual result:
- The cluster thinks it’s healthy.
- One node doesn’t receive any data anymore.
Additionally: depending on who becomes leader, the amount of messages varies (obviously, since it’s no longer synching). Node0: 310 messages, Node1: 310 messages, Node2: 81 messages
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (10 by maintainers)
Commits related to this issue
- Fix for #2642 There was a bug that would erase the sync subject for upper level catchup for streams. Raft layer repair was ok but if that was compacted it gets kicked up to the upper layers which wou... — committed to nats-io/nats-server by derekcollison 3 years ago
- Fix for #2642 There was a bug that would erase the sync subject for upper level catchup for streams. Raft layer repair was ok but if that was compacted it gets kicked up to the upper layers which wou... — committed to nats-io/nats-server by derekcollison 3 years ago
- Merge pull request #2648 from nats-io/issue-2642 [FIXED] #2642 — committed to nats-io/nats-server by derekcollison 3 years ago
I am happy with the results, the dataset you sent us, even with expiration of msgs disabled, could be successfully repaired. And of course the original bug is fixed as well.
Should merge today and we will cut a release.
/cc @kozlovic
2.6.3 has been released with this fix. Just upgrade server and the system will repair and stabilize, no other action items needed from you.
Thanks again.
ok I found the source of the bug. Will fix that but also try to make it such that the cluster can recover from it if possible.
Meaning the bug led to a bad state and then additional corruption. I fixed the source of the bug but also will make sure we can recover your corrupt state properly.
I am digging into this today, will keep everyone updated.
In our cluster I use the official Helm chart. So a rollout is a normal rolling shutdown / wait / start / wait-for-ready / continue cycle.
The state I attached to the ticket didn’t involve a restart / update though.