kubernetes: kubelet fails with `UnmountVolume.NewUnmounter failed for volume` and `vol_data.json: no such file or directory` for CSI volumes and floods the logs

What happened?

I think kubelet sometimes is not able to properly identify unmounted volume and clean up its actualstateofworld cache. It thinks the volume is not unmounted and keeps on trying unmounting it. This outputs a lot of logs which flood journalctl -> fluentd/bit -> third party logging solution. The flooding causes 2 problems:

  1. Consumes node disk which reduces the amount of disk available for pods
  2. Increases ingestion rate for paid third party logging solution thereby costing extra money1 and 2 can be worked around by rotating the logs more frequently, dropping the logs or filtering them out from ingestion into the third party logging solution but it is not a permanent fix to the problem. Also, note that re-starting kubelet fixes the issue. Here’s what the error log looks like:
"Feb 13 22:05:35 ip-10-57-71-57 kubelet: E0213 22:05:35.753487   28558 reconciler.go:193] \"operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \\\"drupal-code\\\" (UniqueName: \\\"kubernetes.io/csi/48033fa8-05f1-4c32-b109-4e25f23107ef-drupal-code\\\") pod \\\"48033fa8-05f1-4c32-b109-4e25f23107ef\\\" (UID: \\\"48033fa8-05f1-4c32-b109-4e25f23107ef\\\") : UnmountVolume.NewUnmounter failed for volume \\\"drupal-code\\\" (UniqueName: \\\"kubernetes.io/csi/48033fa8-05f1-4c32-b109-4e25f23107ef-drupal-code\\\") pod \\\"48033fa8-05f1-4c32-b109-4e25f23107ef\\\" (UID: \\\"48033fa8-05f1-4c32-b109-4e25f23107ef\\\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/48033fa8-05f1-4c32-b109-4e25f23107ef/volumes/kubernetes.io~csi/drupal-code/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/48033fa8-05f1-4c32-b109-4e25f23107ef/volumes/kubernetes.io~csi/drupal-code/vol_data.json]: open /var/lib/kubelet/pods/48033fa8-05f1-4c32-b109-4e25f23107ef/volumes/kubernetes.io~csi/drupal-code/vol_data.json: no such file or directory\" err=\"UnmountVolume.NewUnmounter failed for volume \\\"drupal-code\\\" (UniqueName: \\\"kubernetes.io/csi/48033fa8-05f1-4c32-b109-4e25f23107ef-drupal-code\\\") pod \\\"48033fa8-05f1-4c32-b109-4e25f23107ef\\\" (UID: \\\"48033fa8-05f1-4c32-b109-4e25f23107ef\\\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/48033fa8-05f1-4c32-b109-4e25f23107ef/volumes/kubernetes.io~csi/drupal-code/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/48033fa8-05f1-4c32-b109-4e25f23107ef/volumes/kubernetes.io~csi/drupal-code/vol_data.json]: open /var/lib/kubelet/pods/48033fa8-05f1-4c32-b109-4e25f23107ef/volumes/kubernetes.io~csi/drupal-code/vol_data.json: no such file or directory\"

If you look at the code for 1.22, you can see the the log is output here.

Couple of things to note here:

  1. We have seen this happen a lot for short-lived pods.
  2. We have already looked at the following issues/PRs but they don’t solve the problem
    1. https://github.com/kubernetes/kubernetes/issues/101378
    2. https://github.com/kubernetes/kubernetes/issues/101911
    3. https://github.com/kubernetes/kubernetes/issues/85280
    4. https://github.com/kubernetes/kubernetes/pull/102576 is ported back to 1.21 and 1.22 but it doesn’t fix the problem
  3. We have seen this problem happen in 1.21 and 1.22 version of kubelet
  4. The volume is mounted using https://github.com/warm-metal/csi-driver-image
  5. Because kubelet thinks the volume is still mounted, it doesn’t cleanly take out the pod. The Pod gets stuck in Terminating state because all the volumes attached the pod are not unmounted yet.
  6. Note that the directory structure of CSI volume is present. Only vol_data.json is absent image

What did you expect to happen?

kubelet to not get stuck trying to unmount a volume

How can we reproduce it (as minimally and precisely as possible)?

We have tried reproducing this issue but it is quite difficult to reproduce manually. So far we have only been able to observe it in our logs after it has happened.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:33:37Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.16-eks-ffeb93d", GitCommit:"52e500d139bdef42fbc4540c357f0565c7867a81", GitTreeState:"clean", BuildDate:"2022-11-29T18:41:42Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Docker

Related plugins (CNI, CSI, …) and versions (if applicable)

CSI plugin: https://github.com/warm-metal/csi-driver-image

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 34 (26 by maintainers)

Most upvoted comments

Alright one last update. If anyone is running into problems like these, make sure your CSI driver implements NodeUnpublish correctly at very minimum and idempotent. This issue imo is almost entirely caused by problematic CSI driver implementations.

Got the same issue today randomly out of nowhere on a microk8s 1.27 cluster and lonhorn 1.5.1, so it’s not fixed yet. The journalctl keep flooding this error every ms:

Sep 06 20:56:56 microk8s.daemon-kubelite[4155471]: E0906 20:56:56.044356 4155471 reconciler_common.go:169] “operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume "pvc-ce65fb13-5ebf-4bd3-a202-b698de05c261" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-ce65fb13-5ebf-4bd3-a202-b698de05c261") pod "5fcddea2-77a4-475f-9a4c-a62f5243b353" (UID: "5fcddea2-77a4-475f-9a4c-a62f5243b353") : UnmountVolume.NewUnmounter failed for volume "pvc-ce65fb13-5ebf-4bd3-a202-b698de05c261" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-ce65fb13-5ebf-4bd3-a202-b698de05c261") pod "5fcddea2-77a4-475f-9a4c-a62f5243b353" (UID: "5fcddea2-77a4-475f-9a4c-a62f5243b353") : kubernetes.io/csi: unmounter failed to load volume data file [/var/snap/microk8s/common/var/lib/kubelet/pods/5fcddea2-77a4-475f-9a4c-a62f5243b353/volumes/kubernetes.io~csi/pvc-ce65fb13-5ebf-4bd3-a202-b698de05c261/mount]: kubernetes.io/csi: failed to open volume data file [/var/snap/microk8s/common/var/lib/kubelet/pods/5fcddea2-77a4-475f-9a4c-a62f5243b353/volumes/kubernetes.io~csi/pvc-ce65fb13-5ebf-4bd3-a202-b698de05c261/vol_data.json]: open /var/snap/microk8s/common/var/lib/kubelet/pods/5fcddea2-77a4-475f-9a4c-a62f5243b353/volumes/kubernetes.io~csi/pvc-ce65fb13-5ebf-4bd3-a202-b698de05c261/vol_data.json: no such file or directory”

I had a promotheus pod stuck in “terminating”. I managed to force delete it but the errors continue. How to stop it?

I think these two issues are the same, there is a TOCTOU problem and vol.data maybe removed multiple times which trigger this issue. We have fixed this issue by skipping UnmountDevice when vol_data.json file does not exist, since in this situation, UnmountDevice must has been executed successfully before, no need to do UnmountDevice multiple times.

@cvvz, @Chaunceyctx, does it look like a duplicate of #114207 ? Or something very similar, now in SetUp / TearDown instead of MountDevice / UnmountDevice?

These two problems look like similar. TearDown Operation may be performed multiple times. I will take a look if I have time LOL.

@HirazawaUi @vadasambar I saw the similar issue too on Openshift OCP cluster https://issues.redhat.com/browse/OCPBUGS-11470 Version

Server Version: 4.13.0-rc.2
Kubernetes Version: v1.26.2+dc93b13

and

Server Version: 4.12.0
Kubernetes Version: v1.25.4+77bec7a

AWS with ebs.csi.aws.com csidriver

 % oc get csidriver
NAME              ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
ebs.csi.aws.com   true             false            false             <unset>         false               Persistent   5d3h

How can we reproduce it (as minimally and precisely as possible)? I encountered this in a long run test. The test have 15 concurrent users who create namespace, create pod and pvc under the namespace and then delete the namespace, continuously. The issue happened on about 4-5 days after I started the long run test. It happened very rarely, once or twice in a 5 days long run, which had totally 10k+ namespaces creation/deletion.

This bug is similar to a previous bug: https://bugzilla.redhat.com/show_bug.cgi?id=2038780 https://github.com/kubernetes/kubernetes/pull/110670

Note that when I say it’s hard to reproduce manually, I mean it’s hard to get into a state where vol_data.json is deleted automatically. You can easily reproduce the error message above using the following steps:

  1. Create a pod with CSI volume
  2. Log into the node and go to the pod directory e.g., cd /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<vol-name>
  3. Remove vol_data.json using rm vol_data.json
  4. You should start seeing the above error in kubelet logs

I have reproduced this error in a local cluster running version 1.26.3 following the steps mentioned above. However, I am uncertain if this error will be triggered when kubelet is functioning normally. I will continue to monitor it. Starting from a particular issue may help to better understand the source code of kubelet 😃

/assign i’ll have a look