aws-efs-csi-driver: mounts hang when efs-csi-node pods are restarted because of empty privateKey.pem

/kind bug

What happened?

The same issue as #178 and #569, still not solved.

After the EKS driver container is replaced (i.e. by terminating the driver process or upgrading the driver to a new image), all existing mounts on that node hang for 1 hour

Warning  FailedMount  22m (x62 over 4h59m)   kubelet  Unable to attach or mount volumes: unmounted volumes=efs-data

Warning  FailedMount  6m54s (x81 over 5h2m)  kubelet  MountVolume.SetUp failed for volume "xxx-efs" : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock: connect: connection refused
mount failed: exit status 1

Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-xxxx,tls,noatime fs-xxxxx:/ /var/lib/kubelet/pods/xxxxx/volumes/kubernetes.io~csi/xxxxx/mount
Failed to create certificate signing request (csr), error is: b'unable to load Private Key\xxxx:error:xxx routines:PEM_read_bio:no start line:pem_lib.c:707:Expecting: ANY PRIVATE KEY\

Reason

privateKey.pem that is persisted on the node happens to become an empty file, but the check is not able to detect that and thus not recreating the key. Hence the node is stale for 1 hour until the cert is is purged.

After efs-csi-node pod restart, note the empty file privateKey.pem when above issues are logged:

/ # ls -la /host/var/amazon/efs/
-rw-r--r--    1 root     root          2707 Apr 26 07:08 efs-utils.conf
-rw-r--r--    1 root     root          4789 Apr 26 01:16 efs-utils.crt
-rw-r--r--    1 root     root             0 Apr 26 01:17 privateKey.pem

Workaround: delete the privateKey.pem and restart the pod:

# ls -la /host/var/amazon/efs/
-rw-r--r--    1 root     root          2707 Apr 26 08:46 efs-utils.conf
-rw-r--r--    1 root     root          4789 Apr 26 01:16 efs-utils.crt
-r--------    1 root     root          2484 Apr 26 08:47 privateKey.pem

What you expected to happen? Nodes to stay healthy

How to reproduce it (as minimally and precisely as possible)?

same as #178

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): 1.21
  • Driver version: 1.3.7

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 7
  • Comments: 16 (2 by maintainers)

Commits related to this issue

Most upvoted comments

Workaround for the time being, add an initContainer that deletes the invalid key:

      initContainers:
      - command: ["/bin/sh"]
        args: ["-c", "test -s /var/amazon/efs/privateKey.pem || rm -f /var/amazon/efs/privateKey.pem"]
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efs-csi-driver:v1.3.7
        name: purge-invalid-key
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /var/amazon/efs
          name: efs-utils-config

Would be nice if this could be picked up since we are also facing this…

I think this PR to efs-utils should fix this issue https://github.com/aws/efs-utils/pull/174/files. We are still facing the issue even with the initContainer.

v1.4.9 didn’t fix it completely because the fix was only applied to the watchdog folder, and the same code is duplicated in the mount_efs one

same here, seems that sometime the privateKey.pem is empty

workaround of @universam1 is working, but I had to switch to another image (alpine) because rm command is not in the efs-csi-node image itself.

That said, it’s currently not possible to add this workaround directly on the chart (no initContainers here: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/charts/aws-efs-csi-driver/templates/node-daemonset.yaml). @mskanth972 would it be possible update the chart so we can set initContainers in the daemonset?

According to efs-utils #130, this has been fixed in v1.4.9 of aws-efs-csi-driver.

Somehow we’ve encountered the bug today with the driver in version 1.3.3 after a rollout where we wanted to restart the driver for all the cluster.

At first it looked like a mount issue (unbound immediate PersistentVolumeClaims). The socket returning a “connection refused” made us think of a network issue, but our security groups for the EFS Access Points were OK… It’s only after a while that the private key message appears. And it was the issue mentionned here.

I suppose the issue is that the driver wants to start a STunnel (because we enforce TLS for the nfs call to EFS), and for that it wants to create a certificate (thus crafting a CSR). As the private key is corrupted (empty), it fails.

Thanks a lot @universam1 for the analysis and the workaround!