aws-efs-csi-driver: Mount hanging indefinitely
/kind bug
What happened?
Processes accessing the nfs mount enter an uninterupable sleep trying to access the disk. Happens weekly on different machines in an EKS cluster. dmesg reports the only error with nfs server not responding. No errors on efs-csi-node pod.
What you expected to happen?
The mounts stay responsive.
How to reproduce it (as minimally and precisely as possible)?
This is an EKS cluster running 1.21 with 13 nodes. Everything works fine for an undetermined amount of time until the mount becomes unresponsive.
Anything else we need to know?:
Possibly related to https://github.com/aws/efs-utils/issues/61
sh-4.2$ nc -v localhost 20261 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 127.0.0.1:20261.
sh-4.2$ nc -v fs-0fe7dd1e29e866c20.efs.eu-west-2.amazonaws.com 2049 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 192.168.73.78:2049.
dmesg
[2335453.756731] INFO: task redis-server:17623 blocked for more than 1208 seconds.
[2335453.763204] Not tainted 5.4.141-67.229.amzn2.x86_64 #1
[2335453.767340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2335453.774015] redis-server D 0 17623 17602 0x00000084
[2335453.778080] Call Trace:
[2335453.781004] __schedule+0x267/0x6c0
[2335453.838923] ? bit_wait+0x50/0x50
[2335453.842265] schedule+0x2f/0xa0
[2335453.845475] io_schedule+0x12/0x40
[2335453.848708] bit_wait_io+0xd/0x50
[2335453.851921] __wait_on_bit+0x2c/0x90
[2335453.855186] out_of_line_wait_on_bit+0x91/0xb0
[2335453.858814] ? init_wait_var_entry+0x40/0x40
[2335453.862425] nfs_lock_and_join_requests+0x8f/0x520 [nfs]
[2335453.866441] ? xas_load+0x8/0x80
[2335453.869618] ? find_get_entry+0xb4/0x150
[2335453.873039] nfs_updatepage+0x169/0x940 [nfs]
[2335453.876585] nfs_write_end+0x104/0x460 [nfs]
[2335453.880156] ? copyin+0x22/0x30
[2335453.883270] generic_perform_write+0xfc/0x1b0
[2335453.886842] nfs_file_write+0x107/0x2c0 [nfs]
[2335453.890479] new_sync_write+0x11b/0x1b0
[2335453.893804] vfs_write+0xad/0x1a0
[2335453.896988] ksys_write+0xa1/0xe0
[2335453.900191] do_syscall_64+0x48/0xf0
[2335453.903436] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[2335453.907280] RIP: 0033:0x7fd54efbf4a7
[2335453.910625] Code: Bad RIP value.
[2335453.913824] RSP: 002b:00007ffed8afcff0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[2335453.920318] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fd54efbf4a7
[2335453.926603] RDX: 0000000000000053 RSI: 00007fd544e43005 RDI: 0000000000000021
[2335453.932929] RBP: 00007fd544e43005 R08: 0000000000000002 R09: 000000003aadeee6
[2335453.939238] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000053
[2335453.945607] R13: 0000000000000021 R14: 0000000000000000 R15: 000055d493f5e520
[2383821.915870] nfs: server 127.0.0.1 not responding, still trying
Environment
- Kubernetes version (use
kubectl version):
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
- Driver version: 1.3.5
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 12
- Comments: 45 (1 by maintainers)
Commits related to this issue
- eks-log-collector.sh: add timeout to df command df hangs in some scenarios like https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616 — committed to sedflix/amazon-eks-ami by sedflix 2 years ago
- eks-log-collector.sh: add timeout to df command. (#949) df hangs in some scenarios like https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616 — committed to awslabs/amazon-eks-ami by sedflix 2 years ago
+1 to setting this with the helm chart. Updating all of the node groups with custom launch templates is quite the chore to configure this one setting.
subscribing as we have a similar issue but killing stunnel doesn’t fix it for us still
Jun 7 01:00:38 [ip] kernel: nfs: server 127.0.0.1 not responding, timed out
is there some other things to check/restart ? (vs killing the node; I’d like to understand how it gets in this state and a workaround to unblock it)
Hi @0nebody, Nice job with the workaround daemonset, just be aware that you may not need to mount /proc since “hostPID” is already set to true. This means that the containers within the pod would use the host PID namespace and will already be able to see and interact with the processes in the host namespace.
If you need to test the effect of the script, you can temporarily update your EFS security group to remove/block access on port 2049. This would cause the NFS timeouts to eventually appear in kernel logs which would then trigger the script to execute.
Hi @Cappuccinuo, Stunnel is not included/installed in the EKS optimised AMI/worker nodes, instead it is installed within the EFS CSI Driver image (see here) which is running as a Daemonset in an EKS cluster. This docker image seems to clone the
efs-utilsGitHub repo and install theamazon-efs-utilspackage as per this spec file which specifies stunnel version 4.56. Therefore upgrading it is a bit challenging, it looks like you have to build your own EFS CSI driver image manually to update stunnel. Is it possible for a docker image version/tag of the EFS CSI driver to be available with stunnel upgraded to 5.63?Adding a workaround here for anyone who needs it, the following script monitors the stunnel processes running on the worker node and kills them if they are unresponsive. The killed processes are then re-created by the EFS CSI driver. The stunnel processes are used to add TLS encryption to NFS traffic, you can read more about it here
The above script can be invoked in a regular crontab entry or as a systemd service (see below). Running it as a daemonset is also possible but you would need to run the container as root, mount
/var/lib/kubeletto see NFS mount points created by the EFS CSI Driver and specifyhostPID: truein pod spec to view and kill stunnel processesIf you are using EKS managed nodegroups with a launch template specified (see here), you can use EC2 user data to make the above configuration changes as new EKS worker nodes are being provisioned.
If anyone else wants to try this workaround, feel free to do so but please test before deploying to a production environment.
same issue, I noticed this issue because of:
df -hstrace dfnfs: server 127.0.0.1 not responding, still tryingbydmesgeks vesion: 1.21.5 driver version: 1.3.5
Edited: I upgraded driver version to 1.3.6, still the same issue, I think this is a critical P1 bug.