aws-fsx-csi-driver: Random "Unable to attach or mount volumes"
/kind bug
What happened?
Pods stuck in ContainerCreating and nodes rendered useless somewhat randomly.
We have been seeing this issue happening occasionally for a couple months (possibly more, since we started to use Lustre, but this only seems to appear when there is a lot going on). It is pretty disrupting in production systems.
The timeline of the issue is:
- New pod called
podXgets created, mounts multiple Lustre volumes, both static and dynamically created. podXis scheduled to the a node that has been working finepodXgets stuck inContainerCreatingstatus. The events in the pod are these (lustre volume ID and IPs redacted:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 42m (x94 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23714 vol-23715 workflow-token-csshf vol-23713]: timed out waiting for the condition
Warning FailedMount 17m (x105 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23715 workflow-token-csshf vol-23713 vol-23714]: timed out waiting for the condition
Warning FailedMount 13m (x71 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal MountVolume.SetUp failed for volume "pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e" : rpc error: code = Internal desc = Could not mount "<lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv" at "/var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount": mount failed: exit status 17
Mounting command: mount
Mounting arguments: -t lustre -o flock <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount
Output: mount.lustre: mount <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv at /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount failed: File exists
Warning FailedMount 8m1s (x85 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[workflow-token-csshf vol-23713 vol-23714 vol-23715]: timed out waiting for the condition
Warning FailedMount 77s (x233 over 19h) kubelet, ip-x-x-x-x.eu-west-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23713 vol-23714 vol-23715 workflow-token-csshf]: timed out waiting for the condition
Things to note:
- Once this happens in a node, multiple pods are affected, usually with similar creation times.
- The same issue happens in other nodes, with different Lustre volumes involved (at different times)
- The same Lustre volume is being mounted perfectly fine in other nodes.
- Not all lustre volumes in the affected node have the issue, some are mounted properly
- The node does not recover from this situation, the affected Lustre volumes cannot be mounted anymore in any pod in the node.
What you expected to happen?
I expect that at some point the pods stuck in ContainerCreating are able to mount the volume, or else a way to identify that a node is not healthy “lustre-wise”
How to reproduce it (as minimally and precisely as possible)? Don’t have a reproduction recipe, it just happens randomly. There is https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/213 which has similar error message, but I am not sure is the same, as in my case does not seem reproducible.
Anything else we need to know?:
For context, the manifests of the StorageClass and the PVC, although as I said it happens in both static and dynamically provisioned Lustre volumes. And attached the logs of the driver, and kubelet, at the time of the issue: fsx-driver-kubelet.csv
StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fsx-scratch-sc
mountOptions:
- flock
parameters:
deploymentType: SCRATCH_2
securityGroupIds: sg-0b7e22c1b83b54b18
subnetId: subnet-0b2314c14de6a3ea8
provisioner: fsx.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: fsx.csi.aws.com
labels:
app-name: executor
name: executor-483c6798-pvc
namespace: workflows
resourceVersion: "444522977"
uid: abf3d9ca-ab58-4d82-a2a9-d79768cd038e
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 120000Gi
storageClassName: fsx-scratch-sc
volumeMode: Filesystem
volumeName: pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 120000Gi
phase: Bound
Environment
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
- Driver version:
v0.8.1installed via Helm Chartv1.4.1 - Sidecars of the driver:
sidecars:
livenessProbe:
image:
tag: v2.3.0-eks-1-20-11
nodeDriverRegistrar:
image:
tag: v2.2.0-eks-1-20-11
provisioner:
image:
tag: v2.2.2-eks-1-20-11
resizer:
image:
tag: v1.2.0-eks-1-20-11
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 31 (17 by maintainers)
@kanor1306 Ah I see. I’m going to close this issue since this is no longer an issue for you. @colearendt please feel free to open a new issue with logs/information regarding the specific issue you’re seeing