aws-efs-csi-driver: Pods are getting stuck in ContainerCreating because of EFS CSI driver error
/kind bug
Hello.
What happened? Sometimes, on a new EC2 instance EFS CSI driver can’t handle PVC from Pods running on that node and this cause an issue described in subject.
What you expected to happen? PVC must be executed correctly so Volume can be mounted and attached to the Pod.
How to reproduce it (as minimally and precisely as possible)? It happens intermittently. Generally the process is following:
-
Create
EKScluster andNode group(doesn’t matter what EC2 instance type to choose - it happened ont3.mediumandm5.large). -
Create
EFSandSecurityGroupfor EFS to allow connection to2049/tcpfromVPC CIDR. Say, the EFS ID isfs-33559c06. -
Create
EFS AccessPointfor particular application (e.g. IDfsap-085397e3f5ef75ead) -
Create
StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
- Create PersistentVolume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv-thanos-compactor
labels:
'app.kubernetes.io/name': 'thanos-compactor'
spec:
capacity:
storage: 100Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: 'fs-33559c06::fsap-085397e3f5ef75ead'
volumeAttributes:
encryptInTransit: 'true'
- Create PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
finalizers:
- kubernetes.io/pvc-protection
labels:
app.kubernetes.io/name: thanos-compactor
name: thanos-compactor
namespace: monitoring
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: efs-sc
volumeMode: Filesystem
selector:
matchLabels:
"app.kubernetes.io/name": thanos-compactor
- Create
Pod:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
labels:
app.kubernetes.io/component: compactor
app.kubernetes.io/instance: monitoring-thanos
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: thanos
helm.sh/chart: thanos-3.8.1
pod-template-hash: 7c8b88b68
name: thanos-compactor-7c8b88b68-6hflm
namespace: monitoring
# ...
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: thanos-compactor
# ...
Anything else we need to know?:
- There was an error message as the last line in
cs-driver-registrarcontainer (by the moment of writing I lost the exact timestamp):
I0210 08:47:20.032981 1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Exxxx HH:MM:SS.mmmmmm 1 efs connection.go:129] Lost connection to unix:///csi/csi.sock
-
After restarting EFS CSI Pod the error above disappeared, but it didn’t help to resolve the issue - the only workaround is to
taint/removethe node fro the cluster and create a new EC2 instance. -
Meanwhile logs from the
efs-pluginbefore and after restart are not very helpful. It keeps repeating the same sequence every4 minutes 2 seconds:
I0210 09:43:51.664790 1 node.go:279] NodeGetCapabilities: called with args
I0210 09:43:51.673852 1 node.go:279] NodeGetCapabilities: called with args
I0210 09:43:51.674749 1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" >
I0210 09:43:51.674810 1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:43:51.674986 1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:43:51.675089 1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)
I0210 09:47:53.762514 1 node.go:279] NodeGetCapabilities: called with args
I0210 09:47:53.769379 1 node.go:279] NodeGetCapabilities: called with args
I0210 09:47:53.770295 1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" >
I0210 09:47:53.770365 1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:47:53.770390 1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:47:53.770410 1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)
- Events (
kubectl get events -n monitoring --sort-by=metadata.creationTimestamp):
4m46s Warning FailedMount pod/thanos-compactor-7c8b88b68-6hflm MountVolume.SetUp failed for volume "efs-pv-thanos-compactor" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
33m Warning FailedMount pod/thanos-compactor-7c8b88b68-6hflm Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[objstore-config data thanos-token-f7njj aws-iam-token]: timed out waiting for the condition
Environment
- Kubernetes version (use
kubectl version):
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
- Driver version:
Installed using Helm chart
aws-efs-csi-driverversion1.1.0from the repohttps://kubernetes-sigs.github.io/aws-efs-csi-driverwith default values.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 15
- Comments: 51 (12 by maintainers)
@V0lantis: You can’t reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
Seeing the same issue on the latest version of CSI driver
The Helm chart does not specify default resources. Our findings suggest that a high load average on worker nodes may cause efs csi driver CPU resource starvation. We have a case where we can replicate this behaviour every time.
Setting resource requests for containers seems to solve the problem for us, e.g.:
I’m not entirely sure if
liveness-probeandcsi-driver-registrarresource requests are required, as these containers don’t show spikes in CPU usage.The default
attachRequiredvalue for a driver if CSIDRiver does not exist istrue, so that could explain the attach errors. Please try creating the object and let me know if pods still get stuck.On my part, let me prioritize getting the fix out into the helm chart
In case it gets stuck again, you can try this tool https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/log_collector.py. The driver plugin may not have any useful logs, this tool looks also for logs from efs-utils i.e. the EFS mount helper.
All reports so far say the issue is random/intermittent, that leads me to believe there’s some kind of race condition when lots of mounts/unmounts are hapening at once. So I think to reproduce we’ll need to do some stress testing.
Could you send me kubelet logs? We can continue on slack to share files if needed.
I was able to patch the PVC with the following, then delete it:
but after reapplying the pvc the volumes are not attaching.
shows
ATTACHED= false.@wongma7 do you have any suggestions? We are unable to use the volume at this point from the cluster. This is the recommended tool from AWS for attaching EFS volumes in EKS but so far it has been very unreliable and the logs show nothing new after setting the logging level higher.
I also noticed the default logging level seems to have no helpful logs for this. Are we possibly missing a system log that might show more of what is going on.