nats-server: JetStream does not survive after "stale NFS file handle" error
Defect
I’m running NATS cluster on AWS Fargate. For ~2 months it was running ok (NATS v2.6.6), but 2 days ago it suffered storage issue with EFS backend. JetStream became unavailable. I’ve tried to upgrade NATS to a recent version (v2.7.0) in a hope it would automatically heal itself, but it did not.
Versions of nats-server
and affected client libraries used:
- NATS 2.6.6 or NATS 2.7.0
OS/Container environment:
- AWS Fargate with efs-volume as storage backend
- For the reproduces I’m using a CentOS 7 VM
nats-server -DV:
[20215] 2022/01/16 19:44:28.561325 [INF] Starting nats-server
[20215] 2022/01/16 19:44:28.561508 [INF] Version: 2.7.0
[20215] 2022/01/16 19:44:28.561512 [INF] Git: [not set]
[20215] 2022/01/16 19:44:28.561514 [DBG] Go build: go1.17.6
[20215] 2022/01/16 19:44:28.561518 [INF] Name: NBME2FKEOX72T5L6VUKYUEXKFIWIRUBXAQHDUTJHH3XOBQJUOIQ4CP6Q
[20215] 2022/01/16 19:44:28.561523 [INF] ID: NBME2FKEOX72T5L6VUKYUEXKFIWIRUBXAQHDUTJHH3XOBQJUOIQ4CP6Q
[20215] 2022/01/16 19:44:28.561547 [DBG] Created system account: "$SYS"
[20215] 2022/01/16 19:44:28.561910 [INF] Listening for client connections on 0.0.0.0:4222
[20215] 2022/01/16 19:44:28.561917 [DBG] Get non local IPs for "0.0.0.0"
[20215] 2022/01/16 19:44:28.562071 [DBG] ip=198.61.70.28
[20215] 2022/01/16 19:44:28.562097 [DBG] ip=172.17.0.1
[20215] 2022/01/16 19:44:28.562108 [INF] Server is ready
Steps or code to reproduce the issue:
-
Install NFS server (here and after a CentOS 7 VM is used):
yum install nfs-utils
-
Configure /etc/exports and apply changes with
systemctl restart nfs
:/tmp/nats localhost(rw,no_root_squash)
-
mount NFS:
mount -t nfs -o hard localhost:/tmp/nats /mnt/nfsnats
-
Start JetStream cluster (run each command in a new shell)
nats-server -p 4222 -cluster nats://localhost:4248 --cluster_name test-cluster -routes nats://localhost:4248,nats://localhost:5248,nats://localhost:6248 -js -sd /mnt/nfsnats/natsA --name natsA nats-server -p 5222 -cluster nats://localhost:5248 --cluster_name test-cluster -routes nats://localhost:4248,nats://localhost:5248,nats://localhost:6248 -js -sd /mnt/nfsnats/natsB --name natsB nats-server -p 6222 -cluster nats://localhost:6248 --cluster_name test-cluster -routes nats://localhost:4248,nats://localhost:5248,nats://localhost:6248 -js -sd /mnt/nfsnats/natsC --name natsC
-
Create a stream and start ingesting data:
nats str add --replicas 3 --subjects data --storage file DATA # all other options with default values i=0; while true; do ((i++)); nats pub data $i; sleep 0.3; done
-
Simulate NFS file handle error by “breaking” a mount point of a leader server for DATA stream
cp -r /tmp/nats /tmp/nats.backup rm -rf /tmp/nats mv /tmp/nats.backup /tmp/nats
-
Restart NFS
systemctl restart nfs
- JetStream shuts down:
[6229] 2022/01/16 20:19:39.416387 [ERR] JetStream failed to store a msg on stream '$G > DATA': write /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: bad file descriptor
[6229] 2022/01/16 20:19:39.737394 [ERR] JetStream failed to store a msg on stream '$G > DATA': error opening msg block file ["/mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk"]: open /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: stale NFS file handle
[6229] 2022/01/16 20:19:40.068785 [ERR] JetStream failed to store a msg on stream '$G > DATA': error opening msg block file ["/mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk"]: open /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: stale NFS file handle
[6229] 2022/01/16 20:19:40.401472 [ERR] JetStream failed to store a msg on stream '$G > DATA': error opening msg block file ["/mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk"]: open /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: stale NFS file handle
[6229] 2022/01/16 20:19:44.316535 [ERR] RAFT [wURfHZ9N - _meta_] Critical write error: open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:44.316564 [WRN] RAFT [wURfHZ9N - _meta_] Error writing term and vote file for "_meta_": open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:44.316846 [ERR] JetStream out of resources, will be DISABLED
[6229] 2022/01/16 20:19:44.317655 [ERR] RAFT [wURfHZ9N - _meta_] Critical write error: open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:44.317667 [WRN] RAFT [wURfHZ9N - _meta_] Error writing term and vote file for "_meta_": open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:45.325688 [INF] Initiating JetStream Shutdown...
[6229] 2022/01/16 20:19:45.325943 [INF] JetStream Shutdown
Any JetStream operation fails with “JetStream system temporarily unavailable”
-
Fix mountpoint:
umount -f -l /mnt/nfsnats mount -t nfs -o hard localhost:/tmp/nats /mnt/nfsnats
-
Restarting NATS doesn’t recover JetStream:
$ nats str ls
No Streams defined
All data appears to be lost. But it is there, in /mnt/nfsdata/… !
Expected result:
JetStream data should be recovered as much as possible.
Actual result:
All JetStream data is no longer available.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 30 (14 by maintainers)
If you can please upgrade to at least 2.7.4. JetStream is rapidly improving and best to stay as current as possible.
The upcoming 2.7.5 also has some improvements so if kicking the tires or doing testing you can use our nightly built docker image at syndia/nats-server:nightly.
On version 2.8.1 and experiencing the log flooding as well
Is there a way to avoid the amount of log warnings? I just blew through 50gb of logs in less than 24 hours.
NATS JetStream is ready for production, and we are working hard with our customers to have a better experience when running inside of K8S.
The areas we want to watch out for are the following.
We have also put more work into out helm charts, specifically utilizing the /healthz endpoint for a server when upgrading before moving to the next server.