nats-server: JetStream Cluster becomes inconsistent: catchup for stream stalled

Defect

Make sure that these boxes are checked before submitting your issue – thank you!

Included nats-server -DV output
Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats-server` and affected client libraries used:

nats:2.6.2-alpine

OS/Container environment:

k8s

Context

I don’t know how this problem arises, but it happens every few hours or days at max.
The production system this is captured on has a message rate of about 1/s, with a few bursts every few hours of maybe 5/s (at least according to Grafana).
The cluster consists of three nodes, and a stream with 3 replicas.
After hours/days, one of the nodes start emitting Catchup for stream '$G > corejobs' stalled
All consumers, where this node is the leader, stop getting messages.
I dumped the jetstream data directory so you can get a look at the current state.

Steps or code to reproduce the issue:

Start the cluster with the data and config provided here.
Make sure, nats-cluster-2 is not the leader. (use nats stream cluster step-down if necessary)
Send a message to the stream: nats pub core.jobs.notifications.SendPush foobar
Observe the logs of nats-cluster-2 (which now throws warnings)

Additionally: check the message count in the stream; when nats-cluster-2 is the leader, it differs from when nats-cluster-0 or nats-cluster-1 are leaders.

Expected result:

Either:

All three nodes have the same data.
The node which can’t catch up is marked faulty and steps down from all leader positions

Also: JetStream should always be able to recover from inconsistent states (especially if there is still a majority of healthy nodes around).

Actual result:

The cluster thinks it’s healthy.
One node doesn’t receive any data anymore.

Additionally: depending on who becomes leader, the amount of messages varies (obviously, since it’s no longer synching). Node0: 310 messages, Node1: 310 messages, Node2: 81 messages

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 17 (10 by maintainers)

Commits related to this issue

Fix for #2642 There was a bug that would erase the sync subject for upper level catchup for streams. Raft layer repair was ok but if that was compacted it gets kicked up to the upper layers which wou... — committed to nats-io/nats-server by derekcollison 3 years ago
Fix for #2642 There was a bug that would erase the sync subject for upper level catchup for streams. Raft layer repair was ok but if that was compacted it gets kicked up to the upper layers which wou... — committed to nats-io/nats-server by derekcollison 3 years ago
Merge pull request #2648 from nats-io/issue-2642 [FIXED] #2642 — committed to nats-io/nats-server by derekcollison 3 years ago

Most upvoted comments

I am happy with the results, the dataset you sent us, even with expiration of msgs disabled, could be successfully repaired. And of course the original bug is fixed as well.

Should merge today and we will cut a release.

/cc @kozlovic

derekcollison on Oct 27, 2021

2.6.3 has been released with this fix. Just upgrade server and the system will repair and stabilize, no other action items needed from you.

Thanks again.

derekcollison on Oct 28, 2021

ok I found the source of the bug. Will fix that but also try to make it such that the cluster can recover from it if possible.

Meaning the bug led to a bad state and then additional corruption. I fixed the source of the bug but also will make sure we can recover your corrupt state properly.

derekcollison on Oct 27, 2021

I am digging into this today, will keep everyone updated.

derekcollison on Oct 26, 2021

In our cluster I use the official Helm chart. So a rollout is a normal rolling shutdown / wait / start / wait-for-ready / continue cycle.

The state I attached to the ticket didn’t involve a restart / update though.

aksdb on Oct 22, 2021