kubernetes: Pods do not get cleaned up

https://github.com/kubernetes/kubernetes/issues/28750 describes the problem for a much older Kubernetes version and is marked as fixed.

Is this a BUG REPORT or FEATURE REQUEST? (choose one): Bug Report

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1+coreos.0", GitCommit:"9212f77ed8c169a0afa02e58dce87913c6387b3e", GitTreeState:"clean", BuildDate:"2017-04-04T00:32:53Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

3 Physical Linux servers
CoreOS 1353.7.0 (latest stable)
4.9.24-coreos
Set up from scratch, custom OSPF networking using CNI
Hyperkube using kubelet-wrapper

What happened: Some terminated pods are not cleaned up (staying in Terminating state) for a long time (maybe indefinitely?) because of issues deleting secret volumes.

Log excerpt:

May 11 20:56:07 yellow kubelet-wrapper[7595]: E0511 20:56:07.540894    7595 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/91eabb5e-336f-11e7-927a-d43d7e00dee7-default-token-86xhs\" (\"91eabb5e-336f-11e7-927a-d43d7e00dee7\")" failed. No retries permitted until 2017-05-11 20:58:07.540871959 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/91eabb5e-336f-11e7-927a-d43d7e00dee7-default-token-86xhs" (volume.spec.Name: "default-token-86xhs") pod "91eabb5e-336f-11e7-927a-d43d7e00dee7" (UID: "91eabb5e-336f-11e7-927a-d43d7e00dee7") with: rename /var/lib/kubelet/pods/91eabb5e-336f-11e7-927a-d43d7e00dee7/volumes/kubernetes.io~secret/default-token-86xhs /var/lib/kubelet/pods/91eabb5e-336f-11e7-927a-d43d7e00dee7/volumes/kubernetes.io~secret/wrapped_default-token-86xhs.deleting~739204662: device or resource busy
May 11 20:56:07 yellow kubelet-wrapper[7595]: E0511 20:56:07.540858    7595 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/6d50be26-3371-11e7-a5bf-74d435166e57-default-token-86xhs\" (\"6d50be26-3371-11e7-a5bf-74d435166e57\")" failed. No retries permitted until 2017-05-11 20:58:07.540839041 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/6d50be26-3371-11e7-a5bf-74d435166e57-default-token-86xhs" (volume.spec.Name: "default-token-86xhs") pod "6d50be26-3371-11e7-a5bf-74d435166e57" (UID: "6d50be26-3371-11e7-a5bf-74d435166e57") with: rename /var/lib/kubelet/pods/6d50be26-3371-11e7-a5bf-74d435166e57/volumes/kubernetes.io~secret/default-token-86xhs /var/lib/kubelet/pods/6d50be26-3371-11e7-a5bf-74d435166e57/volumes/kubernetes.io~secret/wrapped_default-token-86xhs.deleting~556013427: device or resource busy
May 11 20:56:07 yellow kubelet-wrapper[7595]: I0511 20:56:07.540561    7595 reconciler.go:190] UnmountVolume operation started for volume "kubernetes.io/secret/91eabb5e-336f-11e7-927a-d43d7e00dee7-default-token-86xhs" (spec.Name: "default-token-86xhs") from pod "91eabb5e-336f-11e7-927a-d43d7e00dee7" (UID: "91eabb5e-336f-11e7-927a-d43d7e00dee7").
May 11 20:56:07 yellow kubelet-wrapper[7595]: I0511 20:56:07.540462    7595 reconciler.go:190] UnmountVolume operation started for volume "kubernetes.io/secret/6d50be26-3371-11e7-a5bf-74d435166e57-default-token-86xhs" (spec.Name: "default-token-86xhs") from pod "6d50be26-3371-11e7-a5bf-74d435166e57" (UID: "6d50be26-3371-11e7-a5bf-74d435166e57").

Excerpt from mount:

tmpfs on /var/lib/kubelet/pods/91eabb5e-336f-11e7-927a-d43d7e00dee7/volumes/kubernetes.io~secret/default-token-86xhs type tmpfs (rw,relatime,seclabel)

Output from fuser -vm:

                     USER        PID ACCESS COMMAND
/var/lib/kubelet/pods/91eabb5e-336f-11e7-927a-d43d7e00dee7/volumes/kubernetes.io~secret/default-token-86xhs:
                     root     kernel mount /var/lib/kubelet/pods/91eabb5e-336f-11e7-927a-d43d7e00dee7/volumes/kubernetes.io~secret/default-token-86xhs

The reason why they’re not being cleaned up is because the volume is not being unmounted before being moved to the deletion area and you can’t move a directory which is a mountpoint with a rename() syscall, which is what Go does internally when you call os.Rename.

What you expected to happen: The Pods should be cleaned up.

How to reproduce it (as minimally and precisely as possible): Happens on all three machines, so probably CoreOS + kubelet-wrapper should work.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 4
Comments: 16 (9 by maintainers)

Commits related to this issue

Fix pods not getting cleaned up - Mark kubelet datadir volume as a recursive mount in kubelet-wrapper https://github.com/kubernetes/kubernetes/issues/45688 https://github.com/coreos/coreos-overla... — committed to puneetkatyal/ansible-coreos-kubelet by puneetkatyal 7 years ago

Most upvoted comments

@lorenz I had the same issue, but it’s now resolved. Maybe you could try “grep -l container_id /proc/*/mountinfo” to check who’s preventing your pod from terminating.

moon03432 on May 24, 2017

@rambo45 Seeing exactly the same. Luckily CoreOS still ships 1.12 if you enable it, so I’m currently running that option everywhere. I have a lot of container churn, so staying on 17.09 (what CoreOS ships by default) was not an option, within 24h I accumulated a few hundred pods stuck in terminating. Still awaiting a proper cri-containerd so that I can get rid of Docker for good.

lorenz on Feb 5, 2018