nats-server: [Jetstream] Stream and consumer went out of sync after rolling restart of NATs servers

Observed behavior

In a Kubernetes statefulset deployment of NATs cluster, I have a simple 3-replica Interest-based stream, that has a single 3-replica consumer.

After an rolling update deployment that upgrades the NATs cluster from v2.10.6 to v2.10.7, the stream and consumer went into unrecoverable bad state, that:

message published to the subject (stream) is accepted but immediately dropped, as if the stream doesn’t have any consumer.
the stream sequence number went out of sync between 3 nodes. nats-0, which is the stream leader at the time, got its stream seq numbers reset to 0, while nats-1 and nats-2 kept the previous stream seq numbers (~23K)
- when new message comes in, only the nats-0 sees seq numbers increasing. nats-1 and nats-2 have seq numbers stuck still.
the seq numbers at consumer remained at ~23K.
- when new message comes in, consumer seq numbers aren’t moving.

I’ve done several attempts to fix:

Rolling restart the nats server
- The nats server restart bought the stream seq number back in sync (all servers now dropped to single-digit, no more servers with ~23K). but the consumer seq number were still stuck in ~23K. The messages were still dropped by stream and never delivered to consumer.
Rolling restart my consumer application
- no effect
Delete and recreate consumer
- the consumer seq number are finally reset, and moves as messages comes in. My consumer application finally was able to receive messages.

Expected behavior

Throughout rolling update and version upgrade,

Stream sequence number is in sync between replicas
Consumer does not go out of sync with Stream
Messages published to Stream are delivered to Consumers

Server and client version

Server: upgraded from v2.10.6 to v2.10.7 Client: jnats (java) 2.17.1

Host environment

Kubernetes deployment with official helm chart
3 Replicas
Ephemeral storage (emptyDir)

Steps to reproduce

No response

About this issue

Original URL
State: open
Created 7 months ago
Reactions: 1
Comments: 16 (5 by maintainers)

Most upvoted comments

This sounds very similar to the problem we have: https://github.com/nats-io/nats-server/issues/4351

yoadey on Dec 19, 2023

I think the impact of this issue is concerning. Would be great if it can be addressed. I’m happy to provide all the deployment setups for troubleshooting. Though I’m not confident if this issue can be reproduced though, as I feel it’s a race condition, something like the leader was elected to a new node before it’s in sync with the cluster.

Please let me know, thanks.

jzhn on Dec 12, 2023