nats-server: After lot of restart due to oom nats panic: corrupt state file
Defect
On nats cluster affected by https://github.com/nats-io/nats-server/issues/3517 afer 1115 restarts on container nats server panic on start.
Versions of nats-server
and affected client libraries used:
[56] 2022/10/10 11:11:13.961798 [INF] Starting nats-server [56] 2022/10/10 11:11:13.961853 [INF] Version: 2.9.3-beta.1 [56] 2022/10/10 11:11:13.961859 [INF] Git: [cb086bce] [56] 2022/10/10 11:11:13.961863 [DBG] Go build: go1.19.1
OS/Container environment:
AWS EKS
Steps or code to reproduce the issue:
start 3x nats cluster, with slow disks, create jetstream with file storage and 3 replicas. Push more data than disk are able consume, until nats write disk cache cause container to OOM. Since that nats cluster is in OOM loop after while data on disk are corrupted and nats panic.
Expected result:
nats and try to recover from last state
Actual result:
nats fail to start: [142] 2022/10/10 11:00:45.333410 [DBG] Exiting consumer monitor for ‘$G > STX_SERVER_DATA > DB_SERVICE’ [C-R3F-OuTvxCQ0] panic: corrupt state file
goroutine 51 [running]: github.com/nats-io/nats-server/v2/server.(*jetStream).applyConsumerEntries(0xc00024c000, 0xc000262d80, 0x0?, 0x0) /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3992 +0x816 github.com/nats-io/nats-server/v2/server.(*jetStream).monitorConsumer(0xc00024c000, 0xc000262d80, 0xc000251680) /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3871 +0xdc6 github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer.func1() /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3585 +0x25 created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/server.go:3077 +0x85
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (14 by maintainers)
@TomasVojacek Were there an error with the following pattern:
prior to the panic? We could try to get the files for this consumer (with the error I mentioned above, you should get the name of the consumer for which there was the panic) and the files would be under the datastore’s “system” account directory. Suppose the system account is named SYS and a consumer is named
C-R3F-0BBfiMRm
(in my env), this would be the directory to get the data from:We could try to see what type of corruption occurred.
Assuming it is real corruption (meaning file “truncated”), then maybe the server should not panic on those cases and try to reset the cluster in hope to be able to recover from the remaining servers? (@derekcollison what do you think?)