nats-server: After lot of restart due to oom nats panic: corrupt state file

Defect

On nats cluster affected by https://github.com/nats-io/nats-server/issues/3517 afer 1115 restarts on container nats server panic on start.

Versions of `nats-server` and affected client libraries used:

[56] 2022/10/10 11:11:13.961798 [INF] Starting nats-server [56] 2022/10/10 11:11:13.961853 [INF] Version: 2.9.3-beta.1 [56] 2022/10/10 11:11:13.961859 [INF] Git: [cb086bce] [56] 2022/10/10 11:11:13.961863 [DBG] Go build: go1.19.1

OS/Container environment:

AWS EKS

Steps or code to reproduce the issue:

start 3x nats cluster, with slow disks, create jetstream with file storage and 3 replicas. Push more data than disk are able consume, until nats write disk cache cause container to OOM. Since that nats cluster is in OOM loop after while data on disk are corrupted and nats panic.

Expected result:

nats and try to recover from last state

Actual result:

nats fail to start: [142] 2022/10/10 11:00:45.333410 [DBG] Exiting consumer monitor for ‘$G > STX_SERVER_DATA > DB_SERVICE’ [C-R3F-OuTvxCQ0] panic: corrupt state file

goroutine 51 [running]: github.com/nats-io/nats-server/v2/server.(*jetStream).applyConsumerEntries(0xc00024c000, 0xc000262d80, 0x0?, 0x0) /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3992 +0x816 github.com/nats-io/nats-server/v2/server.(*jetStream).monitorConsumer(0xc00024c000, 0xc000262d80, 0xc000251680) /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3871 +0xdc6 github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer.func1() /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3585 +0x25 created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/server.go:3077 +0x85

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (14 by maintainers)

Most upvoted comments

@TomasVojacek Were there an error with the following pattern:

JetStream cluster could not decode consumer snapshot for '%s > %s > %s' [%s]

prior to the panic? We could try to get the files for this consumer (with the error I mentioned above, you should get the name of the consumer for which there was the panic) and the files would be under the datastore’s “system” account directory. Suppose the system account is named SYS and a consumer is named C-R3F-0BBfiMRm (in my env), this would be the directory to get the data from:

$ ls -lrt ~/dev/datastore/ds1/jetstream/SYS/_js_/C-R3F-0BBfiMRm/msgs/
total 16
-rw-r-----  1 ivan  staff  85 Oct 10 10:01 1.blk
-rw-r-----  1 ivan  staff  33 Oct 10 10:01 1.idx

We could try to see what type of corruption occurred.

Assuming it is real corruption (meaning file “truncated”), then maybe the server should not panic on those cases and try to reset the cluster in hope to be able to recover from the remaining servers? (@derekcollison what do you think?)

kozlovic on Oct 10, 2022