aws-fsx-csi-driver: Random "Unable to attach or mount volumes"

/kind bug

What happened? Pods stuck in ContainerCreating and nodes rendered useless somewhat randomly.

We have been seeing this issue happening occasionally for a couple months (possibly more, since we started to use Lustre, but this only seems to appear when there is a lot going on). It is pretty disrupting in production systems.

The timeline of the issue is:

  • New pod called podX gets created, mounts multiple Lustre volumes, both static and dynamically created.
  • podX is scheduled to the a node that has been working fine
  • podX gets stuck in ContainerCreating status. The events in the pod are these (lustre volume ID and IPs redacted:
 Type     Reason       Age                  From                                                   Message
  ----     ------       ----                 ----                                                   -------
  Warning  FailedMount  42m (x94 over 19h)   kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23714 vol-23715 workflow-token-csshf vol-23713]: timed out waiting for the condition
  Warning  FailedMount  17m (x105 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23715 workflow-token-csshf vol-23713 vol-23714]: timed out waiting for the condition
  Warning  FailedMount  13m (x71 over 19h)   kubelet, ip-x-x-x-x.eu-west-1.compute.internal  MountVolume.SetUp failed for volume "pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e" : rpc error: code = Internal desc = Could not mount "<lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv" at "/var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount": mount failed: exit status 17
Mounting command: mount
Mounting arguments: -t lustre -o flock <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount
Output: mount.lustre: mount <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv at /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount failed: File exists
  Warning  FailedMount  8m1s (x85 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[workflow-token-csshf vol-23713 vol-23714 vol-23715]: timed out waiting for the condition
  Warning  FailedMount  77s (x233 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23713 vol-23714 vol-23715 workflow-token-csshf]: timed out waiting for the condition

Things to note:

  • Once this happens in a node, multiple pods are affected, usually with similar creation times.
  • The same issue happens in other nodes, with different Lustre volumes involved (at different times)
  • The same Lustre volume is being mounted perfectly fine in other nodes.
  • Not all lustre volumes in the affected node have the issue, some are mounted properly
  • The node does not recover from this situation, the affected Lustre volumes cannot be mounted anymore in any pod in the node.

What you expected to happen? I expect that at some point the pods stuck in ContainerCreating are able to mount the volume, or else a way to identify that a node is not healthy “lustre-wise”

How to reproduce it (as minimally and precisely as possible)? Don’t have a reproduction recipe, it just happens randomly. There is https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/213 which has similar error message, but I am not sure is the same, as in my case does not seem reproducible.

Anything else we need to know?: For context, the manifests of the StorageClass and the PVC, although as I said it happens in both static and dynamically provisioned Lustre volumes. And attached the logs of the driver, and kubelet, at the time of the issue: fsx-driver-kubelet.csv

StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fsx-scratch-sc
mountOptions:
- flock
parameters:
  deploymentType: SCRATCH_2
  securityGroupIds: sg-0b7e22c1b83b54b18
  subnetId: subnet-0b2314c14de6a3ea8
provisioner: fsx.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: fsx.csi.aws.com
  labels:
    app-name: executor
  name: executor-483c6798-pvc
  namespace: workflows
  resourceVersion: "444522977"
  uid: abf3d9ca-ab58-4d82-a2a9-d79768cd038e
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 120000Gi
  storageClassName: fsx-scratch-sc
  volumeMode: Filesystem
  volumeName: pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 120000Gi
  phase: Bound

Environment

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version: v0.8.1 installed via Helm Chart v1.4.1
  • Sidecars of the driver:
  sidecars:
    livenessProbe:
      image:
        tag: v2.3.0-eks-1-20-11
    nodeDriverRegistrar:
      image:
        tag: v2.2.0-eks-1-20-11
    provisioner:
      image:
        tag: v2.2.2-eks-1-20-11
    resizer:
      image:
        tag: v1.2.0-eks-1-20-11

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 31 (17 by maintainers)

Most upvoted comments

@kanor1306 Ah I see. I’m going to close this issue since this is no longer an issue for you. @colearendt please feel free to open a new issue with logs/information regarding the specific issue you’re seeing