nats-server: JetStream Cluster becomes inconsistent: catchup for stream stalled

Defect

Make sure that these boxes are checked before submitting your issue – thank you!

Versions of nats-server and affected client libraries used:

nats:2.6.2-alpine

OS/Container environment:

k8s

Context

  • I don’t know how this problem arises, but it happens every few hours or days at max.
  • The production system this is captured on has a message rate of about 1/s, with a few bursts every few hours of maybe 5/s (at least according to Grafana).
  • The cluster consists of three nodes, and a stream with 3 replicas.
  • After hours/days, one of the nodes start emitting Catchup for stream '$G > corejobs' stalled
  • All consumers, where this node is the leader, stop getting messages.
  • I dumped the jetstream data directory so you can get a look at the current state.

Steps or code to reproduce the issue:

  • Start the cluster with the data and config provided here.
  • Make sure, nats-cluster-2 is not the leader. (use nats stream cluster step-down if necessary)
  • Send a message to the stream: nats pub core.jobs.notifications.SendPush foobar
  • Observe the logs of nats-cluster-2 (which now throws warnings)

Additionally: check the message count in the stream; when nats-cluster-2 is the leader, it differs from when nats-cluster-0 or nats-cluster-1 are leaders.

Expected result:

Either:

  1. All three nodes have the same data.
  2. The node which can’t catch up is marked faulty and steps down from all leader positions

Also: JetStream should always be able to recover from inconsistent states (especially if there is still a majority of healthy nodes around).

Actual result:

  • The cluster thinks it’s healthy.
  • One node doesn’t receive any data anymore.

Additionally: depending on who becomes leader, the amount of messages varies (obviously, since it’s no longer synching). Node0: 310 messages, Node1: 310 messages, Node2: 81 messages

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 17 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I am happy with the results, the dataset you sent us, even with expiration of msgs disabled, could be successfully repaired. And of course the original bug is fixed as well.

Should merge today and we will cut a release.

/cc @kozlovic

2.6.3 has been released with this fix. Just upgrade server and the system will repair and stabilize, no other action items needed from you.

Thanks again.

ok I found the source of the bug. Will fix that but also try to make it such that the cluster can recover from it if possible.

Meaning the bug led to a bad state and then additional corruption. I fixed the source of the bug but also will make sure we can recover your corrupt state properly.

I am digging into this today, will keep everyone updated.

In our cluster I use the official Helm chart. So a rollout is a normal rolling shutdown / wait / start / wait-for-ready / continue cycle.

The state I attached to the ticket didn’t involve a restart / update though.