nats-server: Resetting WAL state errors

Defect

NATS servers logs are filled with warning such as the one below and I’m unable to find any documentation about what’s the cause for this:

[37] [WRN] RAFT [cnrtt3eg - C-R3F-7al3bC6S] Resetting WAL state

Versions of nats-server and affected client libraries used:

  • NATS server version: 2.9.15
  • NATS client (Golang library: github.com/nats-io/nats.go v1.19.0)

OS/Container environment:

Linux containers running on a K8s cluster using a 3 replicas statefulset (pods named nats-0, nats-1 & nats-2). Each replica has its own PVC.

NATS servers running on clustered mode with JetStream enabled. The configuration can be found below:

server_name: $NATS_SERVER_NAME
listen: 0.0.0.0:4222
http: 0.0.0.0:8222

# Logging options
debug: false
trace: false
logtime: false

# Some system overrides
max_connections: 10000
max_payload: 65536

# Clustering definition
cluster {
  name: "nats"
  listen: 0.0.0.0:6222
  routes = [
    nats://nats:6222
  ]
}

# JetStream configuration
jetstream: enabled
jetstream {
  store_dir: /data/jetstream
  max_memory_store: 3G
  max_file_store: 20G
}

There are 4 different streams configured in the cluster with ~50 subjects on each stream. Streams configuration:

             Replicas: 3
              Storage: File

Options:

            Retention: WorkQueue
     Acknowledgements: true
       Discard Policy: New
     Duplicate Window: 12h0m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 12h0m0s
 Maximum Message Size: 1.0 KiB
    Maximum Consumers: unlimited

Steps or code to reproduce the issue:

The issue seems to starts after one of the NATS servers get restarted and, once it happens, it doesn’t stop (I can see 20K logs like this one in the last 12 hours for instance).

Expected result:

The system should tolerate the loss of one NATS server according to JetStream documentation given we’re using a replication factor of 3.

Actual result:

Some streams are totally unusable when this happens (publishers can’t add new message & subscriber don’t receive new messages) while other streams seem to be working as expected.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Feel free to test the RC.7 candidate for 2.9.16, it is under synadia’s docker hub for nats-server.

In general for rolling updates we suggest to lame duck a server, once shutdown restart and wait for /haelthz to return ok, then move to the next server.

We have made many improvements in the upcoming 2.9.16 release in this area. Should be released next week.

In the meantime, you have two general approaches to repairing the raft layer.

  1. Have the leader of the asset in question stepdown. You can track down the mappings from the raft layer name to the named asset using /jsz??acc=YOUR_ACCOUNT&consumers=true&raft=true. Then issue nats consumer cluster step-down <stream> <consumer>
  2. Scale the owning stream down to 1 and back up to the previous replica count. nats stream update --replicas=1 <stream>