kubernetes: Pods stuck in Terminating status (UnmountVolume.TearDown failed)

/kind bug

What happened: Many pods after deleting will stick infinitely in Terminating status, due to inability to unmount volumes (UnmountVolume.TearDown failed; mostly default secrets, but occasionally it would also include openebs volumes). It may be connected with /var/lib/kubelet symlinked into /opt/kubelet, but thats a guess based on hints in #65110. Unlike in the mentioned issue, my problem persisted after updating to 1.10.5.

What you expected to happen: Pod gets removed and secret unmounted.

How to reproduce it (as minimally and precisely as possible): Remove some pods.

Anything else we need to know?:

Symptoms so far:

  • mount, cat /proc/mount and /etc/mtab all lists the volume as mounted - sometimes even multiple times! The entry for sample problem is: tmpfs on /path-to-kubelet/pods/$POD_UID/volumes/kubernetes.io~secret/default-token-xxxxx type tmpfs (rw,relatime)
  • umount on directory listed in /proc/mount replies umount: /path-to-kubelet/pods/$POD_UID/volumes/kubernetes.io~secret/default-token-xxxxx: not mounted
  • rm -rf tells me that Device or resource busy
  • lsof | grep $POD_UID doesn’t show any process using this path
  • docker ps -qa | xargs docker container inspect -f '{{ .Name }} {{ json .Mounts }}' | grep $POD_UID doesn’t show any container using old pod path

Other stuff:

  • Kubernetes 1.9.2, (updated to 1.10.5; issue persisted)
  • Ubuntu 16.04.4 LTS
  • uname -a: Linux 4.4.0-121-generic
  • Docker 17.03.1-ce (updated to 17.03.2-ce; issue persisted)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 7
  • Comments: 21 (5 by maintainers)

Most upvoted comments

This issue is affecting our installations as well. Setting the kubelet flag --root-dir, as mentioned in #65110, to the value referenced by our symlink from /var/lib/kublet seems to resolve the issue.

We noticed this specifically on pods which use volume subPath mounts. The source was either a configMap or secret in most cases

Example recurring error in kubelet logs manifests as a device or resource busy

estedpendingoperations.go:267] Operation for "\"kubernetes.io/configmap/3a92a5bf-8422-11e8-830b-066c762652cc-configmap-XXXXX\" (\"3a92a5bf-8422-11e8-830b-066c762652cc\")" failed. No retries permitted until 2018-07-13 16:03:57.636104303 +0000 UTC m=+67.631270721 (durationBeforeRetry 32s). Error: "error cleaning subPath mounts for volume \"configmap-XXXXXX\" (UniqueName: \"kubernetes.io/configmap/3a92a5bf-8422-11e8-830b-066c762652cc-configmap-XXXXXX\") pod \"3a92a5bf-8422-11e8-830b-066c762652cc\" (UID: \"3a92a5bf-8422-11e8-830b-066c762652cc\") : error deleting /var/lib/kubelet/pods/3a92a5bf-8422-11e8-830b-066c762652cc/volume-subpaths/configmap-XXXXX/xxxxx/1: remove /var/lib/kubelet/pods/3a92a5bf-8422-11e8-830b-066c762652cc/volume-subpaths/configmap-XXXXX/xxxxxx/1: device or resource busy"

After changing the root-dir of kubelet, these entries appears once and then the volume is removed and the pod completes termination

Versions:

  • Kubernetes 1.10.5
  • Ubuntu 16.04.4 LTS
  • uname -a: 4.4.0-1057-aws
  • docker: 1.13.1

We’re running into the same issue running 1.13.5, would be interested in that hotfix command… regards, strowi

We currently have it working on v1.9 to v1.12 using the following parameters on docker 18.06 (latest coreos stable):

ExecStart=/usr/bin/docker run --rm --name %n \
    --net=host --pid=host --privileged \
    -v /:/rootfs:ro \
    -v /etc/kubernetes:/etc/kubernetes:ro \
    -v /etc/kubernetes/ssl/kubelet:/etc/kubernetes/ssl/kubelet \
    -v /etc/cni:/etc/cni \
    -v /opt/cni/bin:/opt/cni/host-bin \
    -v /var/lib/cni:/var/lib/cni \
    --mount type=bind,src="/var/lib/kubelet/",dst="/var/lib/kubelet",bind-propagation=shared \
    -v /var/run:/var/run:rw \
    -v /dev:/dev \
    -v /sys:/sys:ro \
    -v /sys/fs/cgroup:/sys/fs/cgroup:rw \
    -v /var/lib/docker/:/var/lib/docker:rw \
    -v /var/log:/var/log:shared \
    -v /etc/cloud.conf:/etc/cloud.conf:ro \
    -v /etc/ssl/certs/:/etc/ssl/certs/:ro \
    {{ $registry }}{{ $image }} \
...

We are observing this problem on k8s 1.11.3; similar to @chrischdi the kubelet also runs in a container