nats-server: [Jetstream] Stream and consumer went out of sync after rolling restart of NATs servers
Observed behavior
In a Kubernetes statefulset deployment of NATs cluster, I have a simple 3-replica Interest-based stream, that has a single 3-replica consumer.
After an rolling update deployment that upgrades the NATs cluster from v2.10.6 to v2.10.7, the stream and consumer went into unrecoverable bad state, that:
- message published to the subject (stream) is accepted but immediately dropped, as if the stream doesn’t have any consumer.
- the stream sequence number went out of sync between 3 nodes.
nats-0
, which is the stream leader at the time, got its stream seq numbers reset to 0, whilenats-1
andnats-2
kept the previous stream seq numbers (~23K)- when new message comes in, only the
nats-0
sees seq numbers increasing.nats-1
andnats-2
have seq numbers stuck still.
- when new message comes in, only the
- the seq numbers at consumer remained at ~23K.
- when new message comes in, consumer seq numbers aren’t moving.
I’ve done several attempts to fix:
-
Rolling restart the nats server
- The nats server restart bought the stream seq number back in sync (all servers now dropped to single-digit, no more servers with ~23K). but the consumer seq number were still stuck in ~23K. The messages were still dropped by stream and never delivered to consumer.
-
Rolling restart my consumer application
- no effect
-
Delete and recreate consumer
- the consumer seq number are finally reset, and moves as messages comes in. My consumer application finally was able to receive messages.
Expected behavior
Throughout rolling update and version upgrade,
- Stream sequence number is in sync between replicas
- Consumer does not go out of sync with Stream
- Messages published to Stream are delivered to Consumers
Server and client version
Server: upgraded from v2.10.6
to v2.10.7
Client: jnats (java) 2.17.1
Host environment
- Kubernetes deployment with official helm chart
- 3 Replicas
- Ephemeral storage (
emptyDir
)
Steps to reproduce
No response
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 1
- Comments: 16 (5 by maintainers)
This sounds very similar to the problem we have: https://github.com/nats-io/nats-server/issues/4351
I think the impact of this issue is concerning. Would be great if it can be addressed. I’m happy to provide all the deployment setups for troubleshooting. Though I’m not confident if this issue can be reproduced though, as I feel it’s a race condition, something like the leader was elected to a new node before it’s in sync with the cluster.
Please let me know, thanks.