kubernetes: Pods stuck on terminating

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: Pods stuck on terminating for a long time

What you expected to happen: Pods get terminated

How to reproduce it (as minimally and precisely as possible):

  1. Run a deployment
  2. Delete it
  3. Pods are still terminating

Anything else we need to know?: Kubernetes pods stuck as Terminating for a few hours after getting deleted.

Logs: kubectl describe pod my-pod-3854038851-r1hc3

Name:				my-pod-3854038851-r1hc3
Namespace:			container-4-production
Node:				ip-172-16-30-204.ec2.internal/172.16.30.204
Start Time:			Fri, 01 Sep 2017 11:58:24 -0300
Labels:				pod-template-hash=3854038851
				release=stable
				run=my-pod-3
Annotations:			kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"container-4-production","name":"my-pod-3-3854038851","uid":"5816c...
				prometheus.io/scrape=true
Status:				Terminating (expires Fri, 01 Sep 2017 14:17:53 -0300)
Termination Grace Period:	30s
IP:
Created By:			ReplicaSet/my-pod-3-3854038851
Controlled By:			ReplicaSet/my-pod-3-3854038851
Init Containers:
  ensure-network:
    Container ID:	docker://guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/ensure-network@sha256:guid-0
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		True
    Restart Count:	0
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Containers:
  container-1:
    Container ID:	docker://container-id-guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-1@sha256:guid-2
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	1G
    Requests:
      cpu:	100m
      memory:	1G
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-2:
    Container ID:	docker://container-id-guid-2
    Image:		alpine:3.4
    Image ID:		docker-pullable://alpine@sha256:alpine-container-id-1
    Port:		<none>
    Command:
      X
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	20m
      memory:	40M
    Requests:
      cpu:		10m
      memory:		20M
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-3:
    Container ID:	docker://container-id-guid-3
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-3@sha256:guid-3
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	200M
    Requests:
      cpu:	100m
      memory:	100M
    Readiness:	exec [nc -zv localhost 80] delay=1s timeout=1s period=5s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-4:
    Container ID:	docker://container-id-guid-4
    Image:		XXXX
    Image ID:		docker-pullable://repo/container-4@sha256:guid-4
    Port:		9102/TCP
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	600m
      memory:	1500M
    Requests:
      cpu:	600m
      memory:	1500M
    Readiness:	http-get http://:8080/healthy delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /app/config/external from volume-2 (ro)
      /data/volume-1 from volume-1 (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Conditions:
  Type		Status
  Initialized 	True
  Ready 	False
  PodScheduled 	True
Volumes:
  volume-1:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	volume-1
    Optional:	false
  volume-2:
    Type:	ConfigMap (a volume populated by a ConfigMap)
    Name:	external
    Optional:	false
  default-token-xxxxx:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-xxxxx
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>

sudo journalctl -u kubelet | grep “my-pod”

[...]
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing address using workloadID" Workload=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing all IPs with handle 'my-pod-3854038851-r1hc3'"
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-3854038851-r1hc3 workloadId=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Teardown processing complete." Workload=my-pod-3854038851-r1hc3 endpoint=<nil>
Sep 01 17:19:06 ip-172-16-30-204 kubelet[9619]: I0901 17:19:06.591946    9619 kubelet.go:1824] SyncLoop (DELETE, "api"):my-pod-3854038851(b8cf2ecd-8f25-11e7-ba86-0a27a44c875)"

sudo journalctl -u docker | grep “docker-id-for-my-pod”

Sep 01 17:17:55 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:55.695834447Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"
Sep 01 17:17:56 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:56.698913805Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"

Environment:

  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“7”, GitVersion:“v1.7.3”, GitCommit:“2c2fe6e8278a5db2d15a013987b53968c743f2a1”, GitTreeState:“clean”, BuildDate:“2017-08-03T15:13:53Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“6”, GitVersion:“v1.6.6”, GitCommit:“7fa1c1756d8bc963f1a389f4a6937dc71f08ada2”, GitTreeState:“clean”, BuildDate:“2017-06-16T18:21:54Z”, GoVersion:“go1.7.6”, Compiler:“gc”, Platform:“linux/amd64”}

  • Cloud provider or hardware configuration**: AWS

  • OS (e.g. from /etc/os-release): NAME=“CentOS Linux” VERSION=“7 (Core)” ID=“centos” ID_LIKE=“rhel fedora” VERSION_ID=“7” PRETTY_NAME=“CentOS Linux 7 (Core)” ANSI_COLOR=“0;31” CPE_NAME=“cpe:/o:centos:centos:7” HOME_URL=“https://www.centos.org/” BUG_REPORT_URL=“https://bugs.centos.org/

CENTOS_MANTISBT_PROJECT=“CentOS-7” CENTOS_MANTISBT_PROJECT_VERSION=“7” REDHAT_SUPPORT_PRODUCT=“centos” REDHAT_SUPPORT_PRODUCT_VERSION=“7”

  • Kernel (e.g. uname -a): Linux ip-172-16-30-204 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools: Kops

  • Others: Docker version 1.12.6, build 78d1802

@kubernetes/sig-aws @kubernetes/sig-scheduling

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 151
  • Comments: 191 (36 by maintainers)

Commits related to this issue

Most upvoted comments

I have the same issue on Kubernetes 1.8.2 on IBM Cloud. After new pods are started the old pods are stuck in terminating.

kubectl version Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.2-1+d150e4525193f1", GitCommit:"d150e4525193f1c79569c04efc14599d7deb5f3e", GitTreeState:"clean", BuildDate:"2017-10-27T08:15:17Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

I have used kubectl delete pod xxx --now as well as kubectl delete pod foo --grace-period=0 --force to no avail.

FYI I resolved this with a force delete using:

kubectl delete pods <pod> --grace-period=0 --force

And I believe this successfully managed to terminate the pod. Since then I have not experienced the issue again. I have possibly updated since then, so could be a version issue, but not 100% since it’s been so long since I’ve seen the issue.

I have the same issue with Kubernetes 1.8.1 on Azure - after deployment is changed and new pods are have been started, the old pods are stuck at terminating.

This has been happening to me a lot on the latest 1.9.4 release on GKE. Been doing this for now:

kubectl delete pod NAME --grace-period=0 --force

kubectl delete pods <podname> --force --grace-period=0 worked for me!

kubectl delete --all pods --namespace=xxxxx --force --grace-period=0

works for me.

Do not forget about “–grace-period=0”. It matters

I’ve found that if you use --force --grace-period=0 all it does is remove the reference… if you ssh into the node, you’ll still see the docker containers running.

I had a bunch of Pod like that so I had to come up with a command that would clean up all the terminating pods:

kubectl get pods -o json | jq -c '.items[] | select(.metadata.deletionTimestamp) | .metadata.name' | xargs -I '{}' kubectl delete pod --force --grace-period 0 '{}'

I know this is only a work-around, but I’m not waking up sometimes at 3am to fix this anymore. Not saying you should use this, but it might help some people.

The sleep is what I have my pods terminationGracePeriodSeconds is set to (30 seconds). If its alive longer than that, this cronjob will --force --grace-period=0 and kill it completely

kind: CronJob
metadata:
  name: stuckpod-restart
spec:
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 5
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: stuckpod-restart
            image: devth/helm:v2.9.1
            args:
            - /bin/sh
            - -c
            - echo "$(date) Job stuckpod-restart Starting"; kubectl get pods --all-namespaces=true | awk '$3=="Terminating" {print "sleep 30; echo "$(date) Killing pod $1"; kubectl delete pod " $1 " --grace-period=0 --force"}'; echo "$(date) Job stuckpod-restart Complete";
          restartPolicy: OnFailure```

kubectl delete pods <pod> --grace-period=0 --force is a temporary fix, I don’t want to run a manual fix every time there is failover for one of the affected pods. My zookeeper pods aren’t terminating in minikube and on Azure AKS.

Update March 9th 2020 I used a preStop lifecycle hook to manually terminate my pods. My zookeeper pods were stuck in a terminating status and wouldn’t respond to a term signal from within the container. I had basically the same manifest running elsewhere and everything terminates correctly, no clue what the root cause is.

In my experience, sudo systemctl restart docker on the node helps (but there is obviously downtime).

And this is still happening periodically on random nodes that are either A) close to memory limits or B) CPU starved (either bc of some kswapd0 issue which might still be mem related, or actual load)

same issue, super annoying

This is very much active issue still, k8s 1.15.4 and RHEL Docker 1.13.1. All the time pods stay in Terminating but the container is already gone, and k8s cannot figure out itself, but requires human interaction. Makes test scripting real PITA.

/reopen /remove-lifecycle rotten

Have the same bug on my local cluster set up using kubeadm.

docker ps | grep {pod name} on the node shows nothing, and pod stuck in terminating state. I currently have two pods in this state.

What can I do to forcefully delete the pod? Or maybe change name of the pod? I cannot spin up another pod under the same name. Thanks!

@gm42 I was able to manually work around this issue on GKE by:

  1. SSH into the node the stuck pod was scheduled on
  2. Running docker ps | grep {pod name} to get the Docker Container ID
  3. Running docker rm -f {container id}

so, it’s the end of 2018, kube 1.12 is out and … you all still have problems with stuck pods ?

Seeing a similar symptom, pods stuck in terminating(interestingly they all have exec type probe for readiness/liveliness). Looking at the logs I can see: kubelet[1445]: I1022 10:26:32.203865 1445 prober.go:124] Readiness probe for “test-service-74c4664d8d-58c96_default(822c3c3d-082a-4dc9-943c-19f04544713e):test-service” failed (failure): OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown. This message repeats itself forever and changing the exec probe to tcpSocket seems to allow the pod to terminate(based on a test, will follow up on it). The pod seems to have one of the containers “Running” but not “Ready”, the logs for the “Running” container does show as if the service stopped.

Usually volume and network cleanup consume more time in termination.

Correct. They are always suspect.

@igorleao You can try kubectl delete pod xxx --now as well.

Forcing the delete worked for kubectl delete po $pod --grace-period=0 --force. The --now flag wasn’t working. I am not sure about #65936 but I would like to not kill the node when Unknown states happen.

I had this yesterday in 1.9.7, with a pod stuck in terminating state and in the logs it just had “needs to kill pod”, i had to --force --grace-period=0 to get rid.

I’m still having this problem on 1.9.6 on Azure AKS managed cluster.

Using this workaround at the moment to select all stuck pods and delete them (as I end up having swathes of Terminating pods in my dev/scratch cluster):

kubectl get pods | awk '$3=="Terminating" {print "kubectl delete pod " $1 " --grace-period=0 --force"}' | xargs -0 bash -c

Same problem here on GKE 1.9.4-gke.1 seems to be related to volume mounts. It happens every time with filebeats set up as described here: https://github.com/elastic/beats/tree/master/deploy/kubernetes/filebeat

Kubelet log shows this:

Mar 23 19:44:16 gke-testing-c2m4-1-97b57429-40jp kubelet[1361]: I0323 19:44:16.380949    1361 reconciler.go:191] operationExecutor.UnmountVolume started for volume "config" (UniqueName: "kubernetes.io/configmap/9a5f1519-2d39-11e8-bec8-42010a8400f3-config") pod "9a5f1519-2d39-11e8-bec8-42010a8400f3" (UID: "9a5f1519-2d39-11e8-bec8-42010a8400f3")
Mar 23 19:44:16 gke-testing-c2m4-1-97b57429-40jp kubelet[1361]: E0323 19:44:16.382032    1361 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/configmap/9a5f1519-2d39-11e8-bec8-42010a8400f3-config\" (\"9a5f1519-2d39-11e8-bec8-42010a8400f3\")" failed. No retries permitted until 2018-03-23 19:44:32.381982706 +0000 UTC m=+176292.263058344 (durationBeforeRetry 16s). Error: "error cleaning subPath mounts for volume \"config\" (UniqueName: \"kubernetes.io/configmap/9a5f1519-2d39-11e8-bec8-42010a8400f3-config\") pod \"9a5f1519-2d39-11e8-bec8-42010a8400f3\" (UID: \"9a5f1519-2d39-11e8-bec8-42010a8400f3\") : error checking /var/lib/kubelet/pods/9a5f1519-2d39-11e8-bec8-42010a8400f3/volume-subpaths/config/filebeat/0 for mount: lstat /var/lib/kubelet/pods/9a5f1519-2d39-11e8-bec8-42010a8400f3/volume-ubpaths/config/filebeat/0/..: not a directory"

kubectl delete pod NAME --grace-period=0 --force seems to work. also restarting kubelet works.

Today I encountered an issue that may be the same as the one described, where we had pods on one of our customer systems getting stuck in the terminating state for several day’s. We were also seeing the errors about “Error: UnmountVolume.TearDown failed for volume” with “device or resource busy” repeated for each of the stuck pods.

In our case, it appears to be an issue with docker on RHEL/Centos 7.4 based systems covered in this moby issue: https://github.com/moby/moby/issues/22260 and this moby PR: https://github.com/moby/moby/pull/34886/files

For us, once we set the sysctl option fs.may_detach_mounts=1 within a couple minutes all our Terminating pods cleaned up.

I was only able to get rid of the “stuck in terminating” pods by deleting the fnalizers: kubectl patch -n mynamespace pod mypod -p '{"metadata":{"finalizers":null}}' The kubectl delete pod mypod --force --grace-period=0 didn’t work for me

Echoing @jingxu97, there are a lot of different issues being discussed in this thread. There are many possible reasons why a Pod could get stuck in terminating. We know this is a common issue with many possible root causes! 😃

If you run into this issue, please ensure you file a new bug with a detailed report, including a full dump of the relevant pod YAMLs and kubelet logs. This information is necessary to debug these issues.

I am going to close this particular issue because it dates back to 1.7 and its scope is not actionable. /close

As an FYI we had a fix just hit the master branch recently for fixing a race condition where pods created and deleted rapidly would get stuck in Terminating: https://github.com/kubernetes/kubernetes/pull/98424

The node team is letting this bake for a bit and ensuring tests are stable; I’m not sure it’ll get backported as it’s a large change.

currently have pods stuck for 2+ days in terminating state.

@oscarlofwenhamn As far as I’m aware, this is effectively running sigkill on all processes in that pod, ensuring deletion of zombie processes (source: Point 6 under ‘Termination of Pods’ - https://kubernetes.io/docs/concepts/workloads/pods/pod/#:~:text=When the grace period expires,period 0 (immediate deletion).), and successfully removing the pod (may not happen immediately, but it will happen.)

The guide mentions that it removes the reference, but does not delete the pod itself (source: ‘Force Deletion’ - https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/), however grace-period=0 should effectively sigkill your pod albeit, not immediately.

I’m just reading the docs and the recommended ways to handle the scenario I encountered. The issue I specifically encountered was not a reoccurring issue, and something that happened once; I do believe the REAL fix for this is fixing your deployment, but until you get there, this method should help.

I feel that his comment is relevant because the underlying container (docker or whatever) may still be running and not fully deleted…, The illusion of it “removed” is a little misleading at times

On Thu, Jun 4, 2020 at 9:16 AM, Conner Stephen McCabe < notifications@github.com> wrote:

FYI I resolved this with a force delete using:

kubectl delete pods <pod> --grace-period=0 --force

And I believe this successfully managed to terminate the pod. Since then I have not experienced the issue again. I have possibly updated since then, so could be a version issue, but not 100% since it’s been so long since I’ve seen the issue.

Also, the --force flag doesn’t necessarily mean the pod is removed, it just doesn’t wait for confirmation (and drops the reference, to my understanding). As stated by the warning The resource may continue to run on the cluster indefinetely.

Correct, but the point is that --grace-period=0 forces the delete to happen 😃 not sure why your comment is relevant 😕

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/51835#issuecomment-638840136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH34CDZF7EJRLAQD7OSH2DRU6NCRANCNFSM4DZKZ5VQ .

This happens me when a pod is running out of memory. It doesn’t terminate until the memory usage goes down again.

@AndrewSav

I don’t see any other solutions here to be frank.

Sure, the cluster will be left in an “inconsistent state”. I’d like to understand what you mean exactly by this. Force closing is bad. I also don’t like it, but in my case, I am comfortable destroying and redeploying any resources as required.

In my case, it seems to only get stuck terminating on the pods which have an NFS mount. And only happens when the NFS server goes down before the client tries to go down.

@shinebayar-g , the problem with --force is that it could mean that your container will keep running. It just tells Kubernetes to forget about this pod’s containers. A better solution is to SSH into the VM running the pod and investigate what’s going on with Docker. Try to manually kill the containers with docker kill and if successful, attempt to delete the pod normally again.

I just had the problem that the pods were not terminating because a secret was missing. After I created that secret in that namespace everything was back to normal.

For folks seeing this issue on 1.9.4-gke.1, it is most likely due to https://github.com/kubernetes/kubernetes/issues/61178, which is fixed in 1.9.5 and is being rolled out in GKE this week. The issue is related to the cleanup of subpath mounts of a file (not a directory). @zackify @nodefactory-bk @Tapppi @Stono

IIUC, the original problem in this bug is related to configuration of containerized kubelet, which is different.

Same problem here on GKE 1.9.4-gke.1 Only happens with a specific filebeat daemonset, but recreating all nodes doesn’t help either, it just keeps happening.

I have found the reason in my 1.7.2 Cluster, because another monitor program mount the root path / the root path contain /var/lib/kubelet/pods/ddc66e10-0711-11e8-b905-6c92bf70b164/volumes/kubernetes.io~secret/default-token-bnttf so when kubelet delete pod , but It’s can’t release the volume, the message is :
device or resource busy

steps:

  1. sudo journalctl -u kubelet  this shell help me find the error mesage,

  2. sudo docker inspect <container-id> find the io.kubernetes.pod.uid": “ddc66e10-0711-11e8-b905-6c92bf70b164” and HostConfig–>Binds–> “/var/lib/kubelet/pods/ddc66e10-0711-11e8-b905-6c92bf70b164/volumes/kubernetes.io~secret/default-token-bnttf:/var/run/secrets/kubernetes.io/serviceaccount:ro”

  3. grep -l ddc66e10-0711-11e8-b905-6c92bf70b164 /proc/*/mountinfo

/proc/90225/mountinfo

  1. ps aux | grep 90225

root 90225 1.3 0.0 2837164 42580 ? Ssl Feb01 72:40 ./monitor_program

Have the same bug on my 1.7.2

operationExecutor.UnmountVolume started for volume “default-token-bnttf” (UniqueName: “kubernetes.io/secret/ddc66e10-0711-11e8-b905-6c92bf70b164-default-token-bnttf”) pod “ddc66e10-0711-11e8-b905-6c92bf70b164” kubelet[94382]: E0205 11:35:50.509169 94382 nestedpendingoperations.go:262] Operation for “"kubernetes.io/secret/ddc66e10-0711-11e8-b905-6c92bf70b164-default-token-bnttf" ("ddc66e10-0711-11e8-b905-6c92bf70b164")” failed. No retries permitted until 2018-02-05 11:37:52.509148953 +0800 CST (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume “default-token-bnttf” (UniqueName: “kubernetes.io/secret/ddc66e10-0711-11e8-b905-6c92bf70b164-default-token-bnttf”) pod “ddc66e10-0711-11e8-b905-6c92bf70b164” (UID: “ddc66e10-0711-11e8-b905-6c92bf70b164”) : remove /var/lib/kubelet/pods/ddc66e10-0711-11e8-b905-6c92bf70b164/volumes/kubernetes.io~secret/default-token-bnttf: device or resource busy

I’ve seen it too. Can’t check logs because kubectl complains it can’t connect to docker container and can’t create new pod due to current existence of terminating pod. Rather annoying.

As a workaround I wrote a script which grabs some last lines from /var/log/syslog and search for errors like “Operation for…remove /var/lib/kubelet/pods … directory not empty” or “nfs…device is busy…unmount.nfs” or “stale NFS file handle”. Then it extracts either pod_id or pod full directory and see what mounts it has (like mount | grep $pod_id), then unmounts all and removes the corresponding directories. Eventually kubelet does the rest and gracefully shutdowns and deletes the pods. No more pods in Terminating state.

I put that script in cron to run every minute. As a result - no issue for now, even 3-4 month later. Note: I know, this approach is unreliable and it requires check on every cluster upgrade but it works!

we are facing same issue when we started mounting secrets (shared with many pods). The pod goes in terminating state and stay there forever. Our version is v1.10.0 . The attached docker container is gone but the reference in the API server remains unless I forcefully delete pod with --grace-period=0 --force option.

Looking for a permanent solution.

Faced with this today. What was done:

  1. ssh to node and remove container manually
  2. After that kubectl get pods shows me what my stucked container 0/1 terminating (was 1/1 terminating)
  3. Remove finalizers section from pod, my was foregroundDeletion ( $ kubectl edit pod/name ) --> container removed from pods list
  4. Delete deployment --> all deployment related stuff removed.
kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I removed my stuck pods like this:

user@laptop:~$ kubectl -n storage get pod
NAME                     READY     STATUS        RESTARTS   AGE
minio-65b869c776-47hql   0/1       Terminating   5          1d
minio-65b869c776-bppl6   0/1       Terminating   33         1d
minio-778f4665cd-btnf5   1/1       Running       0          1h
sftp-775b578d9b-pqk5x    1/1       Running       0          28m
user@laptop:~$ kubectl -n storage delete pod minio-65b869c776-47hql --grace-period 0 --force
pod "minio-65b869c776-47hql" deleted
user@laptop:~$ kubectl -n storage delete pod minio-65b869c776-bppl6 --grace-period 0 --force
pod "minio-65b869c776-bppl6" deleted
user@laptop:~$ kubectl -n storage get pod
NAME                     READY     STATUS    RESTARTS   AGE
minio-778f4665cd-btnf5   1/1       Running   0          2h
sftp-775b578d9b-pqk5x    1/1       Running   0          30m
user@laptop:~$

i tried adding timeo=30 and intr, but same issue. this locks it up, must log in to the node and do a umount -f -l on the underlying mount and then can do a kubectl delete --force --grace-period 0 on the pod.

it seems like since this was mounted on behalf of the pod that it could possible be umount ( or force umount after some timeout) on delete automatically.

The attached reproduces this for me on gke

kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.2-gke.1", GitCommit:"75d2af854b1df023c7ce10a8795b85d3dd1f8d37", GitTreeState:"clean", BuildDate:"2018-05-10T17:23:18Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}

k8s-nfs-test.yaml.txt

run it, then delete it. You will get a ‘nfs-client’ stuck in deleting. The reason is the hard mount on the node, and the ‘server’ is deleted first.

Hitting this with k8s 1.9.6, when kubelet is unable to umount Cephfs mount, all pods on the node stay Terminated forever. Had to restart node to recover, kubelet or docker restart did not help.

I spoke too soon.

  Type    Reason   Age   From                                                      Message                                                                                                             [53/7752]
  ----    ------   ----  ----                                                      -------
  Normal  Killing  4m    kubelet, gke-delivery-platform-custom-pool-560b2b96-gcmb  Killing container with id docker://filebeat:Need to kill Pod

Had to destroy it in the brutal fashion.

❯ kks delete pod filebeat-x56v8 --force --grace-period 0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "filebeat-x56v8" deleted

On GKE, upgrading nodes helped instantly.

Affected by the same bug on GKE. Are there any known workarounds for this issue? Using --now does not work.

Looks there are to different bug related to with issue. We have both on the our 1.8.3 cluster.

  1. https://github.com/moby/moby/issues/31768 .It’s docker bug. Reproducible on the docker-ce=17.09.0~ce-0~ubuntu.
  2. Second is more interesting and maybe related to some race condition inside kubelet. We have a lot of pods that used NFS persistence volume with specified subpath in container mounts, somehow some of them is getting stuck in a terminating state after deleting deployments. And there is a lot of messages in the syslog:
 Error: UnmountVolume.TearDown failed for volume "nfs-test" (UniqueName: "kubernetes.io/nfs/39dada78-d9cc-11e7-870d-3c970e298d91-nfs-test") pod "39dada78-d9cc-11e7-870d-3c970e298d91" (UID: "39dada78-d9cc-11e7-870d-3c970e298d91") : remove /var/lib/kubelet/pods/39dada78-d9cc-11e7-870d-3c970e298d91/volumes/kubernetes.io~nfs/nfs-test: directory not empty

And it’s true directory is not empty, it’s unmounted and contains our “subpath” directory! One of the explanation of such behavior:

  1. P1: Start create pod or sync pod
  2. P1: Send signal to volume manger to make mounts/remounts.
  3. P1: Is waiting for mount to be completed.
  4. P1: Receive success mount signal(Actually just check that all volumes are mounted)
  5. Somehow volume become unmounted. May be another deletion process unmount it or some OS bug, or some garbage collector action.
  6. P1: Continue creating container and creates subdirectory in the mount point(already unmounted).
  7. After all previous step pod can’t be deleted, because the mount directory isn’t empty.

Usually volume and network cleanup consume more time in termination. Can you find in which phase your pod is stuck? Volume cleanup for example?

Just to add to the possible causes and for the benefit of those whose google search takes them here. I have a cluster on AWS EKS where namespaces and pods are created/terminated several times a day. Today for the first time I saw a problem where a bunch of pods got stuck in a terminating state for several hours.

I believe the cause of this was a bad underlying node: ➜ ~ kubectl get nodes -A NAME STATUS ROLES AGE VERSION ip-10-x-x-x.ec2.internal NotReady <none> 4h54m v1.16.13-eks-ec92d4

(a standard AWS instance hardware/unresponsive thing)

So what I think happened is that the pods on that node got into a state whereby they could not be terminated. Terminations of pods on good nodes were unaffected. To fix I drained the pods off the bad node and as this in an ASG I was able to just terminate the instance and spin up a new one.

@elrok123 Brilliant - I was indeed ill-informed. I’ve updated my response above, referencing this explanation. Thanks for the detailed response, and a further motivated method for dealing with troublesome pods. Cheers!

FYI I resolved this with a force delete using:

kubectl delete pods <pod> --grace-period=0 --force

And I believe this successfully managed to terminate the pod. Since then I have not experienced the issue again. I have possibly updated since then, so could be a version issue, but not 100% since it’s been so long since I’ve seen the issue.

<del>Also, the --force flag doesn’t necessarily mean the pod is removed, it just doesn’t wait for confirmation (and drops the reference, to my understanding). As stated by the warning The resource may continue to run on the cluster indefinetely.</del>

Edit: I was ill-informed. See elrok123s comment below for further motivation.

@mikesplain: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I am using version 1.10 and I experienced this issue today and I think my problem is related with the issue of mounting secret volume which might have left some task pending and left the pod in termination status forever.

I had to use the --grace-period=0 --force option to terminate the pods.

root@ip-10-31-16-222:/var/log# journalctl -u kubelet | grep dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: I0320 15:50:31.179901 528 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "config-volume" (UniqueName: "kubernetes.io/configmap/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-config-volume") pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds" (UID: "e3d7c57a-4b27-11e9-9aaa-0203c98ff31e") Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: I0320 15:50:31.179935 528 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-xjlgc" (UniqueName: "kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-default-token-xjlgc") pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds" (UID: "e3d7c57a-4b27-11e9-9aaa-0203c98ff31e") Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: I0320 15:50:31.179953 528 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "secret-volume" (UniqueName: "kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume") pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds" (UID: "e3d7c57a-4b27-11e9-9aaa-0203c98ff31e") Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:31.310200 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:31.810156118 +0000 UTC m=+966792.065305175 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:31.885807 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:32.885784622 +0000 UTC m=+966793.140933656 (durationBeforeRetry 1s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxxx-com\" not found" Mar 20 15:50:32 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:32.987385 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:34.987362044 +0000 UTC m=+966795.242511077 (durationBeforeRetry 2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:35 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:35.090836 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:39.090813114 +0000 UTC m=+966799.345962147 (durationBeforeRetry 4s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:39 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:39.096621 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:47.096593013 +0000 UTC m=+966807.351742557 (durationBeforeRetry 8s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:47 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:47.108644 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:51:03.10862005 +0000 UTC m=+966823.363769094 (durationBeforeRetry 16s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:51:03 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:51:03.133029 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:51:35.133006645 +0000 UTC m=+966855.388155677 (durationBeforeRetry 32s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:51:35 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:51:35.184310 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:52:39.184281161 +0000 UTC m=+966919.439430217 (durationBeforeRetry 1m4s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:52:34 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:52:34.005027 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:52:34 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:52:34.005085 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc] Mar 20 15:52:39 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:52:39.196332 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:54:41.196308703 +0000 UTC m=+967041.451457738 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:54:41 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:54:41.296252 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:56:43.296229192 +0000 UTC m=+967163.551378231 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:54:48 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:54:48.118620 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:54:48 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:54:48.118681 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc] Mar 20 15:56:43 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:56:43.398396 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:58:45.398368668 +0000 UTC m=+967285.653517703 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:57:05 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:57:05.118566 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:57:05 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:57:05.118937 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc] Mar 20 15:59:22 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:59:22.118593 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume config-volume default-token-xjlgc]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:59:22 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:59:22.118624 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume config-volume default-token-xjlgc]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]

I am still getting stuck with this issue with k8s v1.11.0. Here is a check-list of what I do to clean up my pods:

  • Make sure that all resources that are attached to the pod have been reclaimed. Not all of them are visible in kubectl get; some of them are only known to the Kubelet the pod is running on, so you will have to follow its log stream locally
  • When all else fails, kubectl edit the failed pod and remove finalizers:- foregroundDeletion

Two more tips:

  • In steady-state a non-confused Kubelet should log no periodic messages whatsoever. Any kind of repeated failure to release something, is the symptom of a stuck pod.
  • you can keep a kubectl delete command blocked in another window to monitor your progress (even on a pod you already “deleted” many times). kubectl delete will terminate as soon as the last stuck resource gets released.

@agolomoodysaada Ah, that makes sense. Thanks for the explanation. So I wouldn’t really know that actual container is really deleted or not right?

When a pod is terminated, we do unmount the volume (assuming that the server is still there). If you are seeing dangling mounts even when the server exists, then that is a bug.

If you use dynamic provisioning with PVCs and PVs, then we don’t allow the PVC (and underlying storage) to be deleted until all Pods referencing it are done using it. If you want to orchestrate the provisioning yourself, then you need to ensure you don’t delete the server until all pods are done using it.

I’m not sure if this is the same issue, but we have started noticing this behaviour since upgrading from 1.9.3 to 10.10.1. It never happened before that. We’re using glusterfs volumes, with SubPath. Kubelet continously logs things like

Apr 23 08:21:11 int-kube-01 kubelet[13018]: I0423 08:21:11.106779   13018 reconciler.go:181] operationExecutor.UnmountVolume started for volume "dev-static" (UniqueName: "kubernetes.io/glusterfs/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f-dev-static") pod "ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f" (UID: "ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f")
Apr 23 08:21:11 int-kube-01 kubelet[13018]: E0423 08:21:11.122027   13018 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/glusterfs/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f-dev-static\" (\"ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f\")" failed. No retries permitted until 2018-04-23 08:23:13.121821027 +1000 AEST m=+408681.605939042 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"dev-static\" (UniqueName: \"kubernetes.io/glusterfs/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f-dev-static\") pod \"ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f\" (UID: \"ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f\") : Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static\nOutput: umount: /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static: target is busy.\n        (In some cases useful info about processes that use\n         the device is found by lsof(8) or fuser(1))\n\n"

and lsof shows indeed that the directory under the glusterfs volumes is still in use:

glusterfs  71570                     root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterti  71570  71571              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glustersi  71570  71572              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterme  71570  71573              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glustersp  71570  71574              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glustersp  71570  71575              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71579              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterio  71570  71580              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71581              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71582              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71583              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71584              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71585              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71586              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterep  71570  71587              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterfu  71570  71592              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere
glusterfu  71570  71593              root   10u      DIR              0,264      4096  9380607748984626555 /var/lib/kubelet/pods/ad8fabbe-4449-11e8-b21a-a2bfb3c62d0f/volumes/kubernetes.io~glusterfs/dev-static/subpathhere

This was all fine on 1.9.3, so it’s as if the fix for this issue has broken our use case 😦

Same issue here on Azure, Kube 1.8.7

For some it might help. We are running kubelet in docker container with --containerized flag and were able to solve this issue with mounting /rootfs, /var/lib/docker and /var/lib/kubelet as shared mounts. Final mounts look like this

      -v /:/rootfs:ro,shared \
      -v /sys:/sys:ro \
      -v /dev:/dev:rw \
      -v /var/log:/var/log:rw \
      -v /run/calico/:/run/calico/:rw \
      -v /run/docker/:/run/docker/:rw \
      -v /run/docker.sock:/run/docker.sock:rw \
      -v /usr/lib/os-release:/etc/os-release \
      -v /usr/share/ca-certificates/:/etc/ssl/certs \
      -v /var/lib/docker/:/var/lib/docker:rw,shared \
      -v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
      -v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
      -v /etc/kubernetes/config/:/etc/kubernetes/config/ \
      -v /etc/cni/net.d/:/etc/cni/net.d/ \
      -v /opt/cni/bin/:/opt/cni/bin/ \

For some more details. This does not properly solve the problem as for every bind mount you’ll get 3 mounts inside kubelet container (2 parasite). But at least shared mount allow to easily unmount them with one shot.

CoreOS does not have this problem. Because the use rkt and not docker for kubelet container. In case our case kubelet runs in Docker and every mount inside kubelet continer gets proposed into /var/lib/docker/overlay/... and /rootfs that’s why we have two parasite mounts for every bind mount volume:

  • one from /rootfs in /rootfs/var/lib/kubelet/<mount>
  • one from /var/lib/docker in /var/lib/docker/overlay/.../rootfs/var/lib/kubelet/<mount>