kubernetes: NodeUnstageVolume not called because unmounter fails when vol_data.json is deleted

What happened:

The upcoming csi-driver-host-path release v1.7.0 will have a check in DeleteVolume that returns an error when the volume is still attached, staged, or published (https://github.com/kubernetes-csi/csi-driver-host-path/pull/260).

Some of the jobs in the csi-driver-host-path repo with Kubernetes 1.21.0 are failing because NodeUnstageVolume is not called, causing DeleteVolume to fail repeatedly until the test times out.

One example:

volume ID 3f3a132b-b25c-11eb-9aba-a6b5a7ade690 in:

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-csi_csi-driver-host-path/289/pull-kubernetes-csi-csi-driver-host-path-1-21-test-on-kubernetes-1-21/1392103346201956352/artifacts/default/csi-hostpathplugin-0/hostpath.log

Corresponds to pvc-c0971c32-17c3-427f-a606-1bbf893caf89 in:

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-csi_csi-driver-host-path/289/pull-kubernetes-csi-csi-driver-host-path-1-21-on-kubernetes-1-21/1392103346180984832/artifacts/cluster-logs/non-alpha/csi-prow-worker2/kubelet.log

There’s one kubelet error that seems relevant:

May 11 13:30:23 csi-prow-worker2 kubelet[247]: E0511 13:30:23.786580     247 reconciler.go:193] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"test-volume\" (UniqueName: \"kubernetes.io/csi/hostpath.csi.k8s.io^ca29f6e4-b25c-11eb-8da9-1e598a3983d2\") pod \"a6441c77-7c03-4d33-8a80-0a37d895f8a9\" (UID: \"a6441c77-7c03-4d33-8a80-0a37d895f8a9\") : UnmountVolume.NewUnmounter failed for volume \"test-volume\" (UniqueName: \"kubernetes.io/csi/hostpath.csi.k8s.io^ca29f6e4-b25c-11eb-8da9-1e598a3983d2\") pod \"a6441c77-7c03-4d33-8a80-0a37d895f8a9\" (UID: \"a6441c77-7c03-4d33-8a80-0a37d895f8a9\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/a6441c77-7c03-4d33-8a80-0a37d895f8a9/volumes/kubernetes.io~csi/pvc-c0971c32-17c3-427f-a606-1bbf893caf89/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/a6441c77-7c03-4d33-8a80-0a37d895f8a9/volumes/kubernetes.io~csi/pvc-c0971c32-17c3-427f-a606-1bbf893caf89/vol_data.json]: open /var/lib/kubelet/pods/a6441c77-7c03-4d33-8a80-0a37d895f8a9/volumes/kubernetes.io~csi/pvc-c0971c32-17c3-427f-a606-1bbf893caf89/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"test-volume\" (UniqueName: \"kubernetes.io/csi/hostpath.csi.k8s.io^ca29f6e4-b25c-11eb-8da9-1e598a3983d2\") pod \"a6441c77-7c03-4d33-8a80-0a37d895f8a9\" (UID: \"a6441c77-7c03-4d33-8a80-0a37d895f8a9\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/a6441c77-7c03-4d33-8a80-0a37d895f8a9/volumes/kubernetes.io~csi/pvc-c0971c32-17c3-427f-a606-1bbf893caf89/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/a6441c77-7c03-4d33-8a80-0a37d895f8a9/volumes/kubernetes.io~csi/pvc-c0971c32-17c3-427f-a606-1bbf893caf89/vol_data.json]: open /var/lib/kubelet/pods/a6441c77-7c03-4d33-8a80-0a37d895f8a9/volumes/kubernetes.io~csi/pvc-c0971c32-17c3-427f-a606-1bbf893caf89/vol_data.json: no such file or directory"

What you expected to happen:

NodeUnstageVolume should be called.

How to reproduce it (as minimally and precisely as possible):

CSI_PROW_KUBERNETES_VERSION=1.21.0 CSI_PROW_TESTS=parallel CSI_SNAPSHOTTER_VERSION=v4.0.0 ./.prow.sh in csi-driver-host-path and hope that it fails.

Alternatively, retest in https://github.com/kubernetes-csi/csi-driver-host-path/pull/289

Anything else we need to know?:

Random observation: in the two cases that I looked at, the affected volume was published twice for different pods.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 22 (21 by maintainers)

Commits related to this issue

relax volume lifecycle checks by default The recently introduced "still in use" check revealed a bug in Kubernetes (https://github.com/kubernetes/kubernetes/issues/101911). While the check itself is ... — committed to pohly/csi-driver-host-path by pohly 3 years ago
relax volume lifecycle checks by default The recently introduced "still in use" check revealed a bug in Kubernetes (https://github.com/kubernetes/kubernetes/issues/101911). While the check itself is ... — committed to pohly/csi-driver-host-path by pohly 3 years ago
relax volume lifecycle checks by default The recently introduced "still in use" check revealed a bug in Kubernetes (https://github.com/kubernetes/kubernetes/issues/101911). While the check itself is ... — committed to pohly/csi-driver-host-path by pohly 3 years ago

Most upvoted comments

So two possible fixes to investigate:

Is it possible to not add volumes to dsw if the pod is terminating? cc @jingxu97
Get rid of that RemoveAll call in orphaned volume cleanup

msau42 on May 18, 2021