aws-efs-csi-driver: Mount hanging indefinitely

/kind bug

What happened?

Processes accessing the nfs mount enter an uninterupable sleep trying to access the disk. Happens weekly on different machines in an EKS cluster. dmesg reports the only error with nfs server not responding. No errors on efs-csi-node pod.

What you expected to happen?

The mounts stay responsive.

How to reproduce it (as minimally and precisely as possible)?

This is an EKS cluster running 1.21 with 13 nodes. Everything works fine for an undetermined amount of time until the mount becomes unresponsive.

Anything else we need to know?:

Possibly related to https://github.com/aws/efs-utils/issues/61

sh-4.2$ nc -v localhost 20261 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 127.0.0.1:20261.

sh-4.2$ nc -v fs-0fe7dd1e29e866c20.efs.eu-west-2.amazonaws.com 2049 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 192.168.73.78:2049.

dmesg

[2335453.756731] INFO: task redis-server:17623 blocked for more than 1208 seconds.
[2335453.763204]       Not tainted 5.4.141-67.229.amzn2.x86_64 #1
[2335453.767340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2335453.774015] redis-server    D    0 17623  17602 0x00000084
[2335453.778080] Call Trace:
[2335453.781004]  __schedule+0x267/0x6c0
[2335453.838923]  ? bit_wait+0x50/0x50
[2335453.842265]  schedule+0x2f/0xa0
[2335453.845475]  io_schedule+0x12/0x40
[2335453.848708]  bit_wait_io+0xd/0x50
[2335453.851921]  __wait_on_bit+0x2c/0x90
[2335453.855186]  out_of_line_wait_on_bit+0x91/0xb0
[2335453.858814]  ? init_wait_var_entry+0x40/0x40
[2335453.862425]  nfs_lock_and_join_requests+0x8f/0x520 [nfs]
[2335453.866441]  ? xas_load+0x8/0x80
[2335453.869618]  ? find_get_entry+0xb4/0x150
[2335453.873039]  nfs_updatepage+0x169/0x940 [nfs]
[2335453.876585]  nfs_write_end+0x104/0x460 [nfs]
[2335453.880156]  ? copyin+0x22/0x30
[2335453.883270]  generic_perform_write+0xfc/0x1b0
[2335453.886842]  nfs_file_write+0x107/0x2c0 [nfs]
[2335453.890479]  new_sync_write+0x11b/0x1b0
[2335453.893804]  vfs_write+0xad/0x1a0
[2335453.896988]  ksys_write+0xa1/0xe0
[2335453.900191]  do_syscall_64+0x48/0xf0
[2335453.903436]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[2335453.907280] RIP: 0033:0x7fd54efbf4a7
[2335453.910625] Code: Bad RIP value.
[2335453.913824] RSP: 002b:00007ffed8afcff0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[2335453.920318] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fd54efbf4a7
[2335453.926603] RDX: 0000000000000053 RSI: 00007fd544e43005 RDI: 0000000000000021
[2335453.932929] RBP: 00007fd544e43005 R08: 0000000000000002 R09: 000000003aadeee6
[2335453.939238] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000053
[2335453.945607] R13: 0000000000000021 R14: 0000000000000000 R15: 000055d493f5e520
[2383821.915870] nfs: server 127.0.0.1 not responding, still trying

Environment

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version: 1.3.5

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 12
  • Comments: 45 (1 by maintainers)

Commits related to this issue

Most upvoted comments

+1 to setting this with the helm chart. Updating all of the node groups with custom launch templates is quite the chore to configure this one setting.

subscribing as we have a similar issue but killing stunnel doesn’t fix it for us still

Jun 7 01:00:38 [ip] kernel: nfs: server 127.0.0.1 not responding, timed out

is there some other things to check/restart ? (vs killing the node; I’d like to understand how it gets in this state and a workaround to unblock it)

Hi @0nebody, Nice job with the workaround daemonset, just be aware that you may not need to mount /proc since “hostPID” is already set to true. This means that the containers within the pod would use the host PID namespace and will already be able to see and interact with the processes in the host namespace.

If you need to test the effect of the script, you can temporarily update your EFS security group to remove/block access on port 2049. This would cause the NFS timeouts to eventually appear in kernel logs which would then trigger the script to execute.

Hi @Cappuccinuo, Stunnel is not included/installed in the EKS optimised AMI/worker nodes, instead it is installed within the EFS CSI Driver image (see here) which is running as a Daemonset in an EKS cluster. This docker image seems to clone the efs-utils GitHub repo and install the amazon-efs-utils package as per this spec file which specifies stunnel version 4.56. Therefore upgrading it is a bit challenging, it looks like you have to build your own EFS CSI driver image manually to update stunnel. Is it possible for a docker image version/tag of the EFS CSI driver to be available with stunnel upgraded to 5.63?

Adding a workaround here for anyone who needs it, the following script monitors the stunnel processes running on the worker node and kills them if they are unresponsive. The killed processes are then re-created by the EFS CSI driver. The stunnel processes are used to add TLS encryption to NFS traffic, you can read more about it here

#!/bin/bash

WAIT_TIME=60 # Time in seconds to wait before polling stunnel processes
TIMEOUT=2    # Time in seconds to wait for response from remote EFS system connected via stunnel

while true; do
    STUNNEL_NFS_MOUNTS=`mount | grep nfs4 | grep 127.0.0.1 | awk '{print $3}'`
    for STUNNEL_MOUNT in $STUNNEL_NFS_MOUNTS; do
        timeout $TIMEOUT stat -t $STUNNEL_MOUNT >/dev/null 2>&1
        if [ $? -eq 124 ]; then
            echo "[`date`] Timeout when connecting to NFS, killing stunnel process"
            NAME=`echo $STUNNEL_MOUNT | sed -r 's:/:.:g'`
            PROCESS=`ps -ef | grep stunnel | grep $NAME`
            PROCESS_NAME=`echo $PROCESS | awk '{ print $8, $9}'`
            PROCESS_PID=`echo $PROCESS | awk '{print $2}'`
            kill -9 $PROCESS_PID
            echo "[`date`] Killed process $PROCESS_PID '$PROCESS_NAME'"
        fi
    done
    sleep $WAIT_TIME
done

The above script can be invoked in a regular crontab entry or as a systemd service (see below). Running it as a daemonset is also possible but you would need to run the container as root, mount /var/lib/kubelet to see NFS mount points created by the EFS CSI Driver and specify hostPID: true in pod spec to view and kill stunnel processes

[Unit]
Description=Monitor health of stunnel processes created by the EFS CSI driver

[Service]
ExecStart=<SCRIPT_LOCATION>

[Install]
WantedBy=multi-user.target

If you are using EKS managed nodegroups with a launch template specified (see here), you can use EC2 user data to make the above configuration changes as new EKS worker nodes are being provisioned.

If anyone else wants to try this workaround, feel free to do so but please test before deploying to a production environment.

same issue, I noticed this issue because of:

  • High cpu load on the node, but with low cpu usage
  • No response from df -h
  • Hang at efs mounting by strace df
  • Reporting nfs: server 127.0.0.1 not responding, still trying by dmesg

eks vesion: 1.21.5 driver version: 1.3.5

Edited: I upgraded driver version to 1.3.6, still the same issue, I think this is a critical P1 bug.