aws-efs-csi-driver: Pods are getting stuck in ContainerCreating because of EFS CSI driver error

/kind bug

Hello.

What happened? Sometimes, on a new EC2 instance EFS CSI driver can’t handle PVC from Pods running on that node and this cause an issue described in subject.

What you expected to happen? PVC must be executed correctly so Volume can be mounted and attached to the Pod.

How to reproduce it (as minimally and precisely as possible)? It happens intermittently. Generally the process is following:

  1. Create EKS cluster and Node group (doesn’t matter what EC2 instance type to choose - it happened on t3.medium and m5.large).

  2. Create EFS and SecurityGroup for EFS to allow connection to 2049/tcp from VPC CIDR. Say, the EFS ID is fs-33559c06.

  3. Create EFS AccessPoint for particular application (e.g. ID fsap-085397e3f5ef75ead)

  4. Create StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
  1. Create PersistentVolume:
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv-thanos-compactor
  labels:
    'app.kubernetes.io/name': 'thanos-compactor'
spec:
  capacity:
    storage: 100Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: 'fs-33559c06::fsap-085397e3f5ef75ead'
    volumeAttributes:
      encryptInTransit: 'true'
  1. Create PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/name: thanos-compactor
  name: thanos-compactor
  namespace: monitoring
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: efs-sc
  volumeMode: Filesystem
  selector:
    matchLabels:
      "app.kubernetes.io/name": thanos-compactor
  1. Create Pod:
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  labels:
    app.kubernetes.io/component: compactor
    app.kubernetes.io/instance: monitoring-thanos
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: thanos
    helm.sh/chart: thanos-3.8.1
    pod-template-hash: 7c8b88b68
  name: thanos-compactor-7c8b88b68-6hflm
  namespace: monitoring

# ...
spec:
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: thanos-compactor

# ...

Anything else we need to know?:

  1. There was an error message as the last line in cs-driver-registrar container (by the moment of writing I lost the exact timestamp):
I0210 08:47:20.032981       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Exxxx HH:MM:SS.mmmmmm       1 efs connection.go:129] Lost connection to unix:///csi/csi.sock
  1. After restarting EFS CSI Pod the error above disappeared, but it didn’t help to resolve the issue - the only workaround is to taint/remove the node fro the cluster and create a new EC2 instance.

  2. Meanwhile logs from the efs-plugin before and after restart are not very helpful. It keeps repeating the same sequence every 4 minutes 2 seconds:

I0210 09:43:51.664790       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:43:51.673852       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:43:51.674749       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" > 
I0210 09:43:51.674810       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:43:51.674986       1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:43:51.675089       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)

I0210 09:47:53.762514       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:47:53.769379       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:47:53.770295       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" > 
I0210 09:47:53.770365       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:47:53.770390       1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:47:53.770410       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)
  1. Events (kubectl get events -n monitoring --sort-by=metadata.creationTimestamp):
4m46s       Warning   FailedMount               pod/thanos-compactor-7c8b88b68-6hflm              MountVolume.SetUp failed for volume "efs-pv-thanos-compactor" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
33m         Warning   FailedMount               pod/thanos-compactor-7c8b88b68-6hflm              Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[objstore-config data thanos-token-f7njj aws-iam-token]: timed out waiting for the condition

Environment

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version: Installed using Helm chart aws-efs-csi-driver version 1.1.0 from the repo https://kubernetes-sigs.github.io/aws-efs-csi-driver with default values.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 15
  • Comments: 51 (12 by maintainers)

Most upvoted comments

@V0lantis: You can’t reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/reopen

Seeing the same issue on the latest version of CSI driver

The Helm chart does not specify default resources. Our findings suggest that a high load average on worker nodes may cause efs csi driver CPU resource starvation. We have a case where we can replicate this behaviour every time.

Setting resource requests for containers seems to solve the problem for us, e.g.:

[...]
      containers:
        - name: efs-plugin
          image: amazon/aws-efs-csi-driver:v1.3.3
          resources:
            limits:
              memory: "1024Mi"
              cpu: "1000m"
            requests:
              memory: "32Mi"
              cpu: "20m"
[...]
        - name: csi-driver-registrar
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "16Mi"
              cpu: "10m"
[...]
        - name: liveness-probe
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "16Mi"
              cpu: "10m"
[...]

I’m not entirely sure if liveness-probe and csi-driver-registrar resource requests are required, as these containers don’t show spikes in CPU usage.

The default attachRequired value for a driver if CSIDRiver does not exist is true, so that could explain the attach errors. Please try creating the object and let me know if pods still get stuck.

On my part, let me prioritize getting the fix out into the helm chart

In case it gets stuck again, you can try this tool https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/log_collector.py. The driver plugin may not have any useful logs, this tool looks also for logs from efs-utils i.e. the EFS mount helper.

All reports so far say the issue is random/intermittent, that leads me to believe there’s some kind of race condition when lots of mounts/unmounts are hapening at once. So I think to reproduce we’ll need to do some stress testing.

Could you send me kubelet logs? We can continue on slack to share files if needed.

I was able to patch the PVC with the following, then delete it:

kubectl -n team1 patch pvc efs-team1 -p '{"metadata":{"finalizers":null}}'

but after reapplying the pvc the volumes are not attaching.

kubectl get volumeattachment

shows ATTACHED = false.

@wongma7 do you have any suggestions? We are unable to use the volume at this point from the cluster. This is the recommended tool from AWS for attaching EFS volumes in EKS but so far it has been very unreliable and the logs show nothing new after setting the logging level higher.

I also noticed the default logging level seems to have no helpful logs for this. Are we possibly missing a system log that might show more of what is going on.