aws-efs-csi-driver: mounts hang when efs-csi-node pods are restarted because of empty privateKey.pem
/kind bug
What happened?
The same issue as #178 and #569, still not solved.
After the EKS driver container is replaced (i.e. by terminating the driver process or upgrading the driver to a new image), all existing mounts on that node hang for 1 hour
Warning FailedMount 22m (x62 over 4h59m) kubelet Unable to attach or mount volumes: unmounted volumes=efs-data
Warning FailedMount 6m54s (x81 over 5h2m) kubelet MountVolume.SetUp failed for volume "xxx-efs" : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock: connect: connection refused
mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-xxxx,tls,noatime fs-xxxxx:/ /var/lib/kubelet/pods/xxxxx/volumes/kubernetes.io~csi/xxxxx/mount
Failed to create certificate signing request (csr), error is: b'unable to load Private Key\xxxx:error:xxx routines:PEM_read_bio:no start line:pem_lib.c:707:Expecting: ANY PRIVATE KEY\
Reason
privateKey.pem that is persisted on the node happens to become an empty file, but the check is not able to detect that and thus not recreating the key. Hence the node is stale for 1 hour until the cert is is purged.
After efs-csi-node pod restart, note the empty file privateKey.pem when above issues are logged:
/ # ls -la /host/var/amazon/efs/
-rw-r--r-- 1 root root 2707 Apr 26 07:08 efs-utils.conf
-rw-r--r-- 1 root root 4789 Apr 26 01:16 efs-utils.crt
-rw-r--r-- 1 root root 0 Apr 26 01:17 privateKey.pem
Workaround: delete the privateKey.pem and restart the pod:
# ls -la /host/var/amazon/efs/
-rw-r--r-- 1 root root 2707 Apr 26 08:46 efs-utils.conf
-rw-r--r-- 1 root root 4789 Apr 26 01:16 efs-utils.crt
-r-------- 1 root root 2484 Apr 26 08:47 privateKey.pem
What you expected to happen? Nodes to stay healthy
How to reproduce it (as minimally and precisely as possible)?
same as #178
Anything else we need to know?:
Environment
- Kubernetes version (use
kubectl version): 1.21 - Driver version: 1.3.7
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 7
- Comments: 16 (2 by maintainers)
Commits related to this issue
- detect invalid private key The `privateKey.pem` can become an empty file due to several reasons, but this case is falsely detected as an existing valid key. Instead of just assuming an existing file ... — committed to o11n/efs-utils by deleted user 2 years ago
- detect invalid private key The `privateKey.pem` can become an empty file due to several reasons, but this case is falsely detected as an existing valid key. Instead of just assuming an existing file ... — committed to aws/efs-utils by deleted user 2 years ago
- detect invalid private key The `privateKey.pem` can become an empty file due to several reasons, but this case is falsely detected as an existing valid key. Instead of just assuming an existing file ... — committed to lmouhib/efs-utils by deleted user 2 years ago
- Check private key file size to skip generation Nowadays, the private key generation function checks if the private key file exists. However, if the openssl command that generates the private key file... — committed to otorreno/efs-utils by otorreno 10 months ago
- Check private key file size to skip generation Nowadays, the private key generation function checks if the private key file exists. However, if the openssl command that generates the private key file... — committed to otorreno/efs-utils by otorreno 10 months ago
- Check private key file size to skip generation Nowadays, the private key generation function checks if the private key file exists. However, if the openssl command that generates the private key file... — committed to aws/efs-utils by otorreno 10 months ago
Workaround for the time being, add an
initContainerthat deletes the invalid key:Would be nice if this could be picked up since we are also facing this…
I think this PR to efs-utils should fix this issue https://github.com/aws/efs-utils/pull/174/files. We are still facing the issue even with the initContainer.
v1.4.9 didn’t fix it completely because the fix was only applied to the watchdog folder, and the same code is duplicated in the mount_efs one
same here, seems that sometime the privateKey.pem is empty
workaround of @universam1 is working, but I had to switch to another image (alpine) because rm command is not in the efs-csi-node image itself.
That said, it’s currently not possible to add this workaround directly on the chart (no
initContainershere: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/charts/aws-efs-csi-driver/templates/node-daemonset.yaml). @mskanth972 would it be possible update the chart so we can set initContainers in the daemonset?According to efs-utils #130, this has been fixed in v1.4.9 of aws-efs-csi-driver.
Somehow we’ve encountered the bug today with the driver in version 1.3.3 after a rollout where we wanted to restart the driver for all the cluster.
At first it looked like a mount issue (unbound immediate PersistentVolumeClaims). The socket returning a “connection refused” made us think of a network issue, but our security groups for the EFS Access Points were OK… It’s only after a while that the private key message appears. And it was the issue mentionned here.
I suppose the issue is that the driver wants to start a STunnel (because we enforce TLS for the nfs call to EFS), and for that it wants to create a certificate (thus crafting a CSR). As the private key is corrupted (empty), it fails.
Thanks a lot @universam1 for the analysis and the workaround!