aws-fsx-csi-driver: Random "Unable to attach or mount volumes"

/kind bug

What happened? Pods stuck in ContainerCreating and nodes rendered useless somewhat randomly.

We have been seeing this issue happening occasionally for a couple months (possibly more, since we started to use Lustre, but this only seems to appear when there is a lot going on). It is pretty disrupting in production systems.

The timeline of the issue is:

New pod called podX gets created, mounts multiple Lustre volumes, both static and dynamically created.
podX is scheduled to the a node that has been working fine
podX gets stuck in ContainerCreating status. The events in the pod are these (lustre volume ID and IPs redacted:

 Type     Reason       Age                  From                                                   Message
  ----     ------       ----                 ----                                                   -------
  Warning  FailedMount  42m (x94 over 19h)   kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23714 vol-23715 workflow-token-csshf vol-23713]: timed out waiting for the condition
  Warning  FailedMount  17m (x105 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23715 workflow-token-csshf vol-23713 vol-23714]: timed out waiting for the condition
  Warning  FailedMount  13m (x71 over 19h)   kubelet, ip-x-x-x-x.eu-west-1.compute.internal  MountVolume.SetUp failed for volume "pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e" : rpc error: code = Internal desc = Could not mount "<lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv" at "/var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount": mount failed: exit status 17
Mounting command: mount
Mounting arguments: -t lustre -o flock <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount
Output: mount.lustre: mount <lustre-fs-id>.fsx.eu-west-1.amazonaws.com@tcp:/rtv3bbmv at /var/lib/kubelet/pods/2d0d102e-e368-4b6a-aa6a-1c49a167cd97/volumes/kubernetes.io~csi/pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e/mount failed: File exists
  Warning  FailedMount  8m1s (x85 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[workflow-token-csshf vol-23713 vol-23714 vol-23715]: timed out waiting for the condition
  Warning  FailedMount  77s (x233 over 19h)  kubelet, ip-x-x-x-x.eu-west-1.compute.internal  Unable to attach or mount volumes: unmounted volumes=[vol-23713], unattached volumes=[vol-23713 vol-23714 vol-23715 workflow-token-csshf]: timed out waiting for the condition

Things to note:

Once this happens in a node, multiple pods are affected, usually with similar creation times.
The same issue happens in other nodes, with different Lustre volumes involved (at different times)
The same Lustre volume is being mounted perfectly fine in other nodes.
Not all lustre volumes in the affected node have the issue, some are mounted properly
The node does not recover from this situation, the affected Lustre volumes cannot be mounted anymore in any pod in the node.

What you expected to happen? I expect that at some point the pods stuck in ContainerCreating are able to mount the volume, or else a way to identify that a node is not healthy “lustre-wise”

How to reproduce it (as minimally and precisely as possible)? Don’t have a reproduction recipe, it just happens randomly. There is https://github.com/kubernetes-sigs/aws-fsx-csi-driver/issues/213 which has similar error message, but I am not sure is the same, as in my case does not seem reproducible.

Anything else we need to know?: For context, the manifests of the StorageClass and the PVC, although as I said it happens in both static and dynamically provisioned Lustre volumes. And attached the logs of the driver, and kubelet, at the time of the issue: fsx-driver-kubelet.csv

StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fsx-scratch-sc
mountOptions:
- flock
parameters:
  deploymentType: SCRATCH_2
  securityGroupIds: sg-0b7e22c1b83b54b18
  subnetId: subnet-0b2314c14de6a3ea8
provisioner: fsx.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: fsx.csi.aws.com
  labels:
    app-name: executor
  name: executor-483c6798-pvc
  namespace: workflows
  resourceVersion: "444522977"
  uid: abf3d9ca-ab58-4d82-a2a9-d79768cd038e
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 120000Gi
  storageClassName: fsx-scratch-sc
  volumeMode: Filesystem
  volumeName: pvc-abf3d9ca-ab58-4d82-a2a9-d79768cd038e
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 120000Gi
  phase: Bound

Environment

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-19T08:38:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Driver version: v0.8.1 installed via Helm Chart v1.4.1
Sidecars of the driver:

  sidecars:
    livenessProbe:
      image:
        tag: v2.3.0-eks-1-20-11
    nodeDriverRegistrar:
      image:
        tag: v2.2.0-eks-1-20-11
    provisioner:
      image:
        tag: v2.2.2-eks-1-20-11
    resizer:
      image:
        tag: v1.2.0-eks-1-20-11

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 31 (17 by maintainers)

Most upvoted comments

@kanor1306 Ah I see. I’m going to close this issue since this is no longer an issue for you. @colearendt please feel free to open a new issue with logs/information regarding the specific issue you’re seeing

jacobwolfaws on May 19, 2023