nats-server: Resetting WAL state errors
Defect
NATS servers logs are filled with warning such as the one below and I’m unable to find any documentation about what’s the cause for this:
[37] [WRN] RAFT [cnrtt3eg - C-R3F-7al3bC6S] Resetting WAL state
Versions of nats-server
and affected client libraries used:
- NATS server version:
2.9.15
- NATS client (Golang library: github.com/nats-io/nats.go v1.19.0)
OS/Container environment:
Linux containers running on a K8s cluster using a 3 replicas statefulset (pods named nats-0
, nats-1
& nats-2
). Each replica has its own PVC.
NATS servers running on clustered mode with JetStream enabled. The configuration can be found below:
server_name: $NATS_SERVER_NAME
listen: 0.0.0.0:4222
http: 0.0.0.0:8222
# Logging options
debug: false
trace: false
logtime: false
# Some system overrides
max_connections: 10000
max_payload: 65536
# Clustering definition
cluster {
name: "nats"
listen: 0.0.0.0:6222
routes = [
nats://nats:6222
]
}
# JetStream configuration
jetstream: enabled
jetstream {
store_dir: /data/jetstream
max_memory_store: 3G
max_file_store: 20G
}
There are 4 different streams configured in the cluster with ~50 subjects on each stream. Streams configuration:
Replicas: 3
Storage: File
Options:
Retention: WorkQueue
Acknowledgements: true
Discard Policy: New
Duplicate Window: 12h0m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: unlimited
Maximum Age: 12h0m0s
Maximum Message Size: 1.0 KiB
Maximum Consumers: unlimited
Steps or code to reproduce the issue:
The issue seems to starts after one of the NATS servers get restarted and, once it happens, it doesn’t stop (I can see 20K logs like this one in the last 12 hours for instance).
Expected result:
The system should tolerate the loss of one NATS server according to JetStream documentation given we’re using a replication factor of 3.
Actual result:
Some streams are totally unusable when this happens (publishers can’t add new message & subscriber don’t receive new messages) while other streams seem to be working as expected.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 15 (8 by maintainers)
Feel free to test the RC.7 candidate for 2.9.16, it is under synadia’s docker hub for
nats-server
.In general for rolling updates we suggest to lame duck a server, once shutdown restart and wait for
/haelthz
to return ok, then move to the next server.We have made many improvements in the upcoming 2.9.16 release in this area. Should be released next week.
In the meantime, you have two general approaches to repairing the raft layer.
/jsz??acc=YOUR_ACCOUNT&consumers=true&raft=true
. Then issuenats consumer cluster step-down <stream> <consumer>
nats stream update --replicas=1 <stream>