kubernetes: Orphaned pod found, but error not a directory occurred when trying to remove the volumes dir

What happened:

One of our nodes had a hard reboot. After this, the following message was occurring in the kubelet logs every 2 seconds:

Oct 07 12:46:32 k8s-master2-staging kubelet[7310]: E1007 12:46:32.359145    7310 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"1d4bfc07-3469-4eaa-992f-6d23c17f3aee\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=1

Indeed, the orphaned pod directory exists and contains 1 stale volume directory with a file in it (probably explaining the “not a directory” error):

sjors@k8s-master2-staging:~$ sudo ls -la /var/lib/kubelet/pods/1d4bfc07-3469-4eaa-992f-6d23c17f3aee/volumes/kubernetes.io~csi/pvc-13c81b28-4038-40d5-b6e8-4194e1d7be0e
total 12
drwxr-x--- 2 root root 4096 Oct  7 12:37 .
drwxr-x--- 3 root root 4096 Oct  2 19:37 ..
-rw-r--r-- 1 root root  270 Oct  7 12:37 vol_data.json

Deleting this file manually leads to immediate resolving of the Kubelet error 2 seconds later:

Oct 07 12:46:40 k8s-master2-staging kubelet[7310]: I1007 12:46:40.359957    7310 kubelet_volumes.go:160] "Cleaned up orphaned pod volumes dir" podUID=1d4bfc07-3469-4eaa-992f-6d23c17f3aee path="/var/lib/kubelet/pods/1d4bfc07-3469-4eaa-992f-6d23c17f3aee/volumes"

What you expected to happen:

I know that the Kubelet tries to clean up orphaned pod directories, and various issues have been fixed regarding this in the past, such as figuring out stale mounts and deleting old directories (#60987 for example).

However, it looks like when there are files present, it fails to remove the stale directory.

How to reproduce it (as minimally and precisely as possible):

I think, but have not attempted, it can be reproduced by these steps:

start a Pod with a mounted PVC
manually create a file inside a Pod’s volumes/kubernetes.io~csi/pvc-... directory
hard-reboot the machine, so that the kubelet gets no chance to clean up the Pod directory, i.e. directory becomes orphaned
(If creating the file of step 2 was not possible before, it can also be done after the reboot, but before the kubelet comes up.)
observe error in the Kubelet logs.

Anything else we need to know?:

The CSI provider used is Hetzner Cloud (hcloud-csi-driver:1.6.0), I’m not sure if it’s responsible for creating the file.

Environment:

Kubernetes version (use kubectl version): 1.22.1
Cloud provider or hardware configuration: bare-metal, 3 masters
OS (e.g: cat /etc/os-release): Ubuntu 20.04.3 LTS
Kernel (e.g. uname -a): 5.4.0-88-generic
Install tools: kubeadm 1.22.1

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 37
Comments: 38 (8 by maintainers)

Commits related to this issue

recursively clean up orphaned volumes When a volume is orphaned, kubelet helpfully attempts to rmdir it. However, in cases where a node is hard shutdown, the directory is not necessarily empty. https... — committed to nikhiljha/kubernetes by nikhiljha 2 years ago

Most upvoted comments

Experienced this issue on k8s v1.25.5 running on AKS using the disk.csi.azure.com provisioner.

Manual deletion of the vol_data.json file ultimately did the trick but was not fun to troubleshoot, 0/10 would not recommend.

bene-tleilax-werdna on May 16, 2023

It happends in k8s 1.22, removing those orphaned pods folders solve this issue (in /var/lib/kubelet/pods/), kubelet is not able to do it automatically.

PawlowskiAdrian on Mar 22, 2022

Discussed in the SIG Storage bug triage meeting. This will be addressed when we work on re-design of reconstruction logic.

Here’s a related bug: https://github.com/kubernetes/kubernetes/issues/111933

As part of SELinux feature, some of the reconstruction issues are currently addressed under an alpha feature gate: https://github.com/kubernetes/enhancements/pull/3548

We will consider separating out the reconstruction logic with a different feature gate.

xing-yang on Jan 18, 2023

Found this bug, still in:

k3s version v1.24.3+k3s1 (990ba0e8) go version go1.18.1

TaoVonQi on Aug 22, 2022

We are facing the same issue on a 1.22 k8s cluster

raunakkumar on Nov 9, 2021

@sgielen from the code: https://github.com/kubernetes/kubernetes/blob/e054181e517b48a3c862207537092c28604aaad9/pkg/kubelet/kubelet_volumes.go#L130 It only can remove empty dir.

yxxhero on Oct 13, 2021

I ran into this with many MANY orphaned directories, so I made a one-line shell script to at least save me from copy/paste. This will look at the log, figure out the orphaned pod, and remove its directory. This is being used on k3os, v1.21.5. Check it yourself before running, as all things from the internet. I take no responsibility for breaking your server!

tail /var/log/k3s-service.log | grep "orphaned pod" | awk '{print $18}' | cut -d\\ -f2 | cut -d\" -f2 | uniq | xargs -I % sh -c 'echo "deleting /var/lib/kubelet/pods/%"; rm -rf /var/lib/kubelet/pods/%;'
deleting /var/lib/kubelet/pods/4966fdcc-16db-4c16-ad75-624edfddf546

If you want to confirm the deletion first, before automating it, here’s another version:

k3os [/var/lib/kubelet/pods]# tail  /var/log/k3s-service.log | grep "orphaned pod" | awk '{print $18}' | cut -d\\ -f2 | cut -d\" -f2 | uniq | xargs -p -I % sh -c 'rm -rf %;'
sh -c 'rm -rf 2198391e-ed80-468d-b6f8-a65d3899853d;' ?...y
sh -c 'rm -rf 221b57a5-2c94-40ef-81b1-e8212612a2c0;' ?...y

And if you want to be really lazy/reckless, here’s a short script which will keep running and deleting directories. Stop the script with ctrl+c when no new directories are being deleted, and check the log that it stopped spamming.

k3os [~]# cat delete-orphans.sh 
#!/bin/bash
while true
do
        tail /var/log/k3s-service.log | grep "orphaned pod" | awk '{print $18}' | cut -d\\ -f2 | cut -d\" -f2 | uniq | xargs -I % sh -c 'echo "deleting /var/lib/kubelet/pods/%"; rm -rf /var/lib/kubelet/pods/%;'
        sleep 1
done

tsrats on Jun 29, 2023

Same issue happened on : v1.24.11

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.11", GitCommit:"0f75679e3346160939924550fd3591462a4afec6", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:00Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

patrickshan on May 2, 2023

I’m seeing the same thing on v1.23.6 while using rook-ceph. I also occasionally use nfs-server-provisioner.

xorcat on Jun 21, 2022