aws-ebs-csi-driver: MountVolume.SetUp failed for volume when quickly re-attaching to same node

/kind bug

What happened? One our setups includes a pod that has an EBS volume which we mount using the aws-ebs-csi-driver, installed using helm.

At a certain time, we may terminate the pod, and sometimes we very quickly bring it back up, using the same volume.

This has a pretty high failure rate, with the following events in the pod:

  Normal   Scheduled    3m59s              default-scheduler  Successfully assigned ci-1408605891-e2e-test-full-raftt-317dae9118d34e6d64f4f960cb789/sandcastle to ip-10-2-8-177.eu-central-1.compute.internal
  Warning  FailedMount  117s               kubelet            Unable to attach or mount volumes: unmounted volumes=[raftt-mount], unattached volumes=[docker-socket kube-api-access-56xmb raftt-mount]: timed out waiting for the condition
  Warning  FailedMount  112s (x9 over 4m)  kubelet            MountVolume.SetUp failed for volume "raftt-pv-vol-0987ddb5694159fb5" : rpc error: code = Internal desc = Could not mount "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount" at "/var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount
Output: mount: /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount does not exist.

Gathering logs from the ebs-csi-node pod:

$ kubectl -n kube-system logs  ebs-csi-node-lsxb2 -c ebs-plugin
...
1101 16:05:04.722735       1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not mount "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount" at "/var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount
Output: mount: /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount does not exist.
E1101 16:05:36.812865       1 mount_linux.go:179] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount
Output: mount: /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount does not exist.

E1101 16:05:36.812926       1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not mount "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount" at "/var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount
Output: mount: /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount does not exist.
E1101 16:06:40.848729       1 mount_linux.go:179] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount
Output: mount: /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount does not exist.

E1101 16:06:40.848788       1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not mount "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount" at "/var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount
Output: mount: /var/lib/kubelet/pods/7ff84ca5-fe5e-4734-a98f-f594d16533ad/volumes/kubernetes.io~csi/raftt-pv-vol-0987ddb5694159fb5/mount: special device /var/lib/kubelet/plugins/kubernetes.io/csi/pv/raftt-pv-vol-0987ddb5694159fb5/globalmount does not exist.
...

From the AWS console I see that the volume is mounted at /dev/xvdbo, and:

sh-4.2# ls -la xvdbo
lrwxrwxrwx 1 root root 8 Nov  1 15:54 xvdbo -> nvme28n1

Looking at lsblk:

sh-4.2# lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1       259:0    0  300G  0 disk
|-nvme0n1p1   259:1    0  300G  0 part /
`-nvme0n1p128 259:2    0    1M  0 part
nvme2n1       259:4    0    5G  0 disk /var/lib/kubelet/pods/e2a8aaef-1831-4caf-8e82-b5b1fc5ed3d0
nvme4n1       259:6    0    5G  0 disk /var/lib/kubelet/pods/8c66a6b3-cee6-44c9-b7d0-61c3c83a2e78
nvme5n1       259:7    0    5G  0 disk /var/lib/kubelet/pods/d5830c4f-4a8a-4472-a22e-213225897150
nvme8n1       259:10   0    5G  0 disk /var/lib/kubelet/pods/7a2b5d4d-35b5-4385-a944-73419bb343b0
nvme10n1      259:12   0    5G  0 disk /var/lib/kubelet/pods/2d449b0b-7f11-4d57-a63f-4865409f52e5
nvme12n1      259:14   0    5G  0 disk /var/lib/kubelet/pods/1a2c830f-5228-4a63-bc70-b33c3047c992
nvme15n1      259:17   0    5G  0 disk /var/lib/kubelet/pods/c2bef48a-ba7f-4b8b-b1ae-953f69fa0818
nvme18n1      259:20   0    5G  0 disk /var/lib/kubelet/pods/39744f99-e543-4679-ba31-edcda1f3e4c2
nvme19n1      259:21   0    5G  0 disk /var/lib/kubelet/pods/211d6610-e0e0-4116-a6dc-3b744b3eb80a
nvme21n1      259:23   0    5G  0 disk /var/lib/kubelet/pods/dd610ee4-279a-4133-bf8e-017dc6301d6c
nvme22n1      259:24   0    5G  0 disk /var/lib/kubelet/pods/bb6c154d-2d28-4271-a315-3b7dd9fa5daf
nvme25n1      259:27   0    5G  0 disk /var/lib/kubelet/pods/34d02041-e9b2-41c0-b494-98bf26b475cb
nvme28n1      259:30   0    5G  0 disk /var/lib/kubelet/pods/d3dab24e-5e47-47e9-8348-0c8040680cd1

So /dev/nvme28n1 is mounted to a different pod! Maybe the previous version of it?

What you expected to happen? The volume be mounted to the new pod.

How to reproduce it (as minimally and precisely as possible)? I believe a kube cluster with the ebs-csi-driver, a pod using a specific volume, and then deleting the pod and immediately re-creating it, using the same volume should do it.

I will try to create a simple recreation soon.

Anything else we need to know?:

Environment

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version: Installed using helm:
aws-ebs-csi-driver	kube-system	9       	2021-09-12 19:38:38.386405 +0300 IDT	deployed	aws-ebs-csi-driver-2.1.1	1.2.1

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 15 (4 by maintainers)

Most upvoted comments

@vdhanan I encountered it when I updated a StatefulSet, replacing the Pod on the same node (in that case, the EBS volume was bound to the same node before).

Experienced the same with grace time at the default 30s.