kubernetes: Pods stuck on terminating
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened: Pods stuck on terminating for a long time
What you expected to happen: Pods get terminated
How to reproduce it (as minimally and precisely as possible):
- Run a deployment
- Delete it
- Pods are still terminating
Anything else we need to know?:
Kubernetes pods stuck as Terminating
for a few hours after getting deleted.
Logs: kubectl describe pod my-pod-3854038851-r1hc3
Name: my-pod-3854038851-r1hc3
Namespace: container-4-production
Node: ip-172-16-30-204.ec2.internal/172.16.30.204
Start Time: Fri, 01 Sep 2017 11:58:24 -0300
Labels: pod-template-hash=3854038851
release=stable
run=my-pod-3
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"container-4-production","name":"my-pod-3-3854038851","uid":"5816c...
prometheus.io/scrape=true
Status: Terminating (expires Fri, 01 Sep 2017 14:17:53 -0300)
Termination Grace Period: 30s
IP:
Created By: ReplicaSet/my-pod-3-3854038851
Controlled By: ReplicaSet/my-pod-3-3854038851
Init Containers:
ensure-network:
Container ID: docker://guid-1
Image: XXXXX
Image ID: docker-pullable://repo/ensure-network@sha256:guid-0
Port: <none>
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Containers:
container-1:
Container ID: docker://container-id-guid-1
Image: XXXXX
Image ID: docker-pullable://repo/container-1@sha256:guid-2
Port: <none>
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 100m
memory: 1G
Requests:
cpu: 100m
memory: 1G
Environment:
XXXX
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
container-2:
Container ID: docker://container-id-guid-2
Image: alpine:3.4
Image ID: docker-pullable://alpine@sha256:alpine-container-id-1
Port: <none>
Command:
X
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 20m
memory: 40M
Requests:
cpu: 10m
memory: 20M
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
container-3:
Container ID: docker://container-id-guid-3
Image: XXXXX
Image ID: docker-pullable://repo/container-3@sha256:guid-3
Port: <none>
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 100m
memory: 200M
Requests:
cpu: 100m
memory: 100M
Readiness: exec [nc -zv localhost 80] delay=1s timeout=1s period=5s #success=1 #failure=3
Environment:
XXXX
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
container-4:
Container ID: docker://container-id-guid-4
Image: XXXX
Image ID: docker-pullable://repo/container-4@sha256:guid-4
Port: 9102/TCP
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 600m
memory: 1500M
Requests:
cpu: 600m
memory: 1500M
Readiness: http-get http://:8080/healthy delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
XXXX
Mounts:
/app/config/external from volume-2 (ro)
/data/volume-1 from volume-1 (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
volume-1:
Type: Secret (a volume populated by a Secret)
SecretName: volume-1
Optional: false
volume-2:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: external
Optional: false
default-token-xxxxx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xxxxx
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
sudo journalctl -u kubelet | grep “my-pod”
[...]
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing address using workloadID" Workload=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing all IPs with handle 'my-pod-3854038851-r1hc3'"
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-3854038851-r1hc3 workloadId=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Teardown processing complete." Workload=my-pod-3854038851-r1hc3 endpoint=<nil>
Sep 01 17:19:06 ip-172-16-30-204 kubelet[9619]: I0901 17:19:06.591946 9619 kubelet.go:1824] SyncLoop (DELETE, "api"):my-pod-3854038851(b8cf2ecd-8f25-11e7-ba86-0a27a44c875)"
sudo journalctl -u docker | grep “docker-id-for-my-pod”
Sep 01 17:17:55 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:55.695834447Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"
Sep 01 17:17:56 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:56.698913805Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"
Environment:
-
Kubernetes version (use
kubectl version
): Client Version: version.Info{Major:“1”, Minor:“7”, GitVersion:“v1.7.3”, GitCommit:“2c2fe6e8278a5db2d15a013987b53968c743f2a1”, GitTreeState:“clean”, BuildDate:“2017-08-03T15:13:53Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“6”, GitVersion:“v1.6.6”, GitCommit:“7fa1c1756d8bc963f1a389f4a6937dc71f08ada2”, GitTreeState:“clean”, BuildDate:“2017-06-16T18:21:54Z”, GoVersion:“go1.7.6”, Compiler:“gc”, Platform:“linux/amd64”} -
Cloud provider or hardware configuration**: AWS
-
OS (e.g. from /etc/os-release): NAME=“CentOS Linux” VERSION=“7 (Core)” ID=“centos” ID_LIKE=“rhel fedora” VERSION_ID=“7” PRETTY_NAME=“CentOS Linux 7 (Core)” ANSI_COLOR=“0;31” CPE_NAME=“cpe:/o:centos:centos:7” HOME_URL=“https://www.centos.org/” BUG_REPORT_URL=“https://bugs.centos.org/”
CENTOS_MANTISBT_PROJECT=“CentOS-7” CENTOS_MANTISBT_PROJECT_VERSION=“7” REDHAT_SUPPORT_PRODUCT=“centos” REDHAT_SUPPORT_PRODUCT_VERSION=“7”
-
Kernel (e.g.
uname -a
): Linux ip-172-16-30-204 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux -
Install tools: Kops
-
Others: Docker version 1.12.6, build 78d1802
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 151
- Comments: 191 (36 by maintainers)
I have the same issue on Kubernetes 1.8.2 on IBM Cloud. After new pods are started the old pods are stuck in terminating.
kubectl version
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.2-1+d150e4525193f1", GitCommit:"d150e4525193f1c79569c04efc14599d7deb5f3e", GitTreeState:"clean", BuildDate:"2017-10-27T08:15:17Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
I have used
kubectl delete pod xxx --now
as well askubectl delete pod foo --grace-period=0 --force
to no avail.FYI I resolved this with a force delete using:
And I believe this successfully managed to terminate the pod. Since then I have not experienced the issue again. I have possibly updated since then, so could be a version issue, but not 100% since it’s been so long since I’ve seen the issue.
I have the same issue with Kubernetes 1.8.1 on Azure - after deployment is changed and new pods are have been started, the old pods are stuck at terminating.
This has been happening to me a lot on the latest 1.9.4 release on GKE. Been doing this for now:
kubectl delete pods <podname> --force --grace-period=0
worked for me!kubectl delete --all pods --namespace=xxxxx --force --grace-period=0
works for me.
Do not forget about “–grace-period=0”. It matters
I’ve found that if you use
--force --grace-period=0
all it does is remove the reference… if you ssh into the node, you’ll still see the docker containers running.I had a bunch of Pod like that so I had to come up with a command that would clean up all the terminating pods:
I know this is only a work-around, but I’m not waking up sometimes at 3am to fix this anymore. Not saying you should use this, but it might help some people.
The sleep is what I have my pods terminationGracePeriodSeconds is set to (30 seconds). If its alive longer than that, this cronjob will --force --grace-period=0 and kill it completely
kubectl delete pods <pod> --grace-period=0 --force
is a temporary fix, I don’t want to run a manual fix every time there is failover for one of the affected pods. My zookeeper pods aren’t terminating in minikube and on Azure AKS.Update March 9th 2020 I used a preStop lifecycle hook to manually terminate my pods. My zookeeper pods were stuck in a terminating status and wouldn’t respond to a term signal from within the container. I had basically the same manifest running elsewhere and everything terminates correctly, no clue what the root cause is.
In my experience,
sudo systemctl restart docker
on the node helps (but there is obviously downtime).And this is still happening periodically on random nodes that are either A) close to memory limits or B) CPU starved (either bc of some kswapd0 issue which might still be mem related, or actual load)
same issue, super annoying
This is very much active issue still, k8s 1.15.4 and RHEL Docker 1.13.1. All the time pods stay in
Terminating
but the container is already gone, and k8s cannot figure out itself, but requires human interaction. Makes test scripting real PITA./reopen /remove-lifecycle rotten
Have the same bug on my local cluster set up using
kubeadm
.docker ps | grep {pod name}
on the node shows nothing, and pod stuck in terminating state. I currently have two pods in this state.What can I do to forcefully delete the pod? Or maybe change name of the pod? I cannot spin up another pod under the same name. Thanks!
@gm42 I was able to manually work around this issue on GKE by:
docker ps | grep {pod name}
to get the Docker Container IDdocker rm -f {container id}
so, it’s the end of 2018, kube 1.12 is out and … you all still have problems with stuck pods ?
Seeing a similar symptom, pods stuck in terminating(interestingly they all have exec type probe for readiness/liveliness). Looking at the logs I can see: kubelet[1445]: I1022 10:26:32.203865 1445 prober.go:124] Readiness probe for “test-service-74c4664d8d-58c96_default(822c3c3d-082a-4dc9-943c-19f04544713e):test-service” failed (failure): OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown. This message repeats itself forever and changing the exec probe to tcpSocket seems to allow the pod to terminate(based on a test, will follow up on it). The pod seems to have one of the containers “Running” but not “Ready”, the logs for the “Running” container does show as if the service stopped.
Correct. They are always suspect.
@igorleao You can try
kubectl delete pod xxx --now
as well.Forcing the delete worked for
kubectl delete po $pod --grace-period=0 --force
. The--now
flag wasn’t working. I am not sure about #65936 but I would like to not kill the node whenUnknown
states happen.I had this yesterday in 1.9.7, with a pod stuck in terminating state and in the logs it just had “needs to kill pod”, i had to
--force --grace-period=0
to get rid.I’m still having this problem on 1.9.6 on Azure AKS managed cluster.
Using this workaround at the moment to select all stuck pods and delete them (as I end up having swathes of Terminating pods in my dev/scratch cluster):
Same problem here on GKE 1.9.4-gke.1 seems to be related to volume mounts. It happens every time with filebeats set up as described here: https://github.com/elastic/beats/tree/master/deploy/kubernetes/filebeat
Kubelet log shows this:
kubectl delete pod NAME --grace-period=0 --force
seems to work. also restarting kubelet works.Today I encountered an issue that may be the same as the one described, where we had pods on one of our customer systems getting stuck in the terminating state for several day’s. We were also seeing the errors about “Error: UnmountVolume.TearDown failed for volume” with “device or resource busy” repeated for each of the stuck pods.
In our case, it appears to be an issue with docker on RHEL/Centos 7.4 based systems covered in this moby issue: https://github.com/moby/moby/issues/22260 and this moby PR: https://github.com/moby/moby/pull/34886/files
For us, once we set the sysctl option fs.may_detach_mounts=1 within a couple minutes all our Terminating pods cleaned up.
I was only able to get rid of the “stuck in terminating” pods by deleting the fnalizers:
kubectl patch -n mynamespace pod mypod -p '{"metadata":{"finalizers":null}}'
The kubectl delete pod mypod --force --grace-period=0
didn’t work for meEchoing @jingxu97, there are a lot of different issues being discussed in this thread. There are many possible reasons why a Pod could get stuck in terminating. We know this is a common issue with many possible root causes! 😃
If you run into this issue, please ensure you file a new bug with a detailed report, including a full dump of the relevant pod YAMLs and kubelet logs. This information is necessary to debug these issues.
I am going to close this particular issue because it dates back to 1.7 and its scope is not actionable. /close
As an FYI we had a fix just hit the master branch recently for fixing a race condition where pods created and deleted rapidly would get stuck in Terminating: https://github.com/kubernetes/kubernetes/pull/98424
The node team is letting this bake for a bit and ensuring tests are stable; I’m not sure it’ll get backported as it’s a large change.
currently have pods stuck for 2+ days in terminating state.
@oscarlofwenhamn As far as I’m aware, this is effectively running sigkill on all processes in that pod, ensuring deletion of zombie processes (source: Point 6 under ‘Termination of Pods’ - https://kubernetes.io/docs/concepts/workloads/pods/pod/#:~:text=When the grace period expires,period 0 (immediate deletion).), and successfully removing the pod (may not happen immediately, but it will happen.)
The guide mentions that it removes the reference, but does not delete the pod itself (source: ‘Force Deletion’ - https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/), however grace-period=0 should effectively sigkill your pod albeit, not immediately.
I’m just reading the docs and the recommended ways to handle the scenario I encountered. The issue I specifically encountered was not a reoccurring issue, and something that happened once; I do believe the REAL fix for this is fixing your deployment, but until you get there, this method should help.
I feel that his comment is relevant because the underlying container (docker or whatever) may still be running and not fully deleted…, The illusion of it “removed” is a little misleading at times
On Thu, Jun 4, 2020 at 9:16 AM, Conner Stephen McCabe < notifications@github.com> wrote:
This happens me when a pod is running out of memory. It doesn’t terminate until the memory usage goes down again.
@AndrewSav
I don’t see any other solutions here to be frank.
Sure, the cluster will be left in an “inconsistent state”. I’d like to understand what you mean exactly by this. Force closing is bad. I also don’t like it, but in my case, I am comfortable destroying and redeploying any resources as required.
In my case, it seems to only get stuck terminating on the pods which have an NFS mount. And only happens when the NFS server goes down before the client tries to go down.
@shinebayar-g , the problem with
--force
is that it could mean that your container will keep running. It just tells Kubernetes to forget about this pod’s containers. A better solution is to SSH into the VM running the pod and investigate what’s going on with Docker. Try to manually kill the containers withdocker kill
and if successful, attempt to delete the pod normally again.I just had the problem that the pods were not terminating because a secret was missing. After I created that secret in that namespace everything was back to normal.
For folks seeing this issue on 1.9.4-gke.1, it is most likely due to https://github.com/kubernetes/kubernetes/issues/61178, which is fixed in 1.9.5 and is being rolled out in GKE this week. The issue is related to the cleanup of subpath mounts of a file (not a directory). @zackify @nodefactory-bk @Tapppi @Stono
IIUC, the original problem in this bug is related to configuration of containerized kubelet, which is different.
Same problem here on GKE 1.9.4-gke.1 Only happens with a specific filebeat daemonset, but recreating all nodes doesn’t help either, it just keeps happening.
I have found the reason in my 1.7.2 Cluster, because another monitor program mount the root path / the root path contain
/var/lib/kubelet/pods/ddc66e10-0711-11e8-b905-6c92bf70b164/volumes/kubernetes.io~secret/default-token-bnttf
so when kubelet delete pod , but It’s can’t release the volume, the message is :device or resource busy
steps:
sudo journalctl -u kubelet this shell help me find the error mesage,
sudo docker inspect <container-id> find the io.kubernetes.pod.uid": “ddc66e10-0711-11e8-b905-6c92bf70b164” and HostConfig–>Binds–> “/var/lib/kubelet/pods/ddc66e10-0711-11e8-b905-6c92bf70b164/volumes/kubernetes.io~secret/default-token-bnttf:/var/run/secrets/kubernetes.io/serviceaccount:ro”
grep -l ddc66e10-0711-11e8-b905-6c92bf70b164 /proc/*/mountinfo
Have the same bug on my 1.7.2
I’ve seen it too. Can’t check logs because kubectl complains it can’t connect to docker container and can’t create new pod due to current existence of terminating pod. Rather annoying.
As a workaround I wrote a script which grabs some last lines from
/var/log/syslog
and search for errors like “Operation for…remove /var/lib/kubelet/pods … directory not empty” or “nfs…device is busy…unmount.nfs” or “stale NFS file handle”. Then it extracts either pod_id or pod full directory and see what mounts it has (likemount | grep $pod_id
), then unmounts all and removes the corresponding directories. Eventually kubelet does the rest and gracefully shutdowns and deletes the pods. No more pods in Terminating state.I put that script in cron to run every minute. As a result - no issue for now, even 3-4 month later. Note: I know, this approach is unreliable and it requires check on every cluster upgrade but it works!
we are facing same issue when we started mounting secrets (shared with many pods). The pod goes in
terminating
state and stay there forever. Our version is v1.10.0 . The attached docker container is gone but the reference in the API server remains unless I forcefully delete pod with--grace-period=0 --force
option.Looking for a permanent solution.
Faced with this today. What was done:
kubectl get pods
shows me what my stucked container0/1 terminating
(was1/1 terminating
)finalizers
section from pod, my wasforegroundDeletion
( $ kubectl edit pod/name ) --> container removed from pods listI removed my stuck pods like this:
i tried adding timeo=30 and intr, but same issue. this locks it up, must log in to the node and do a umount -f -l on the underlying mount and then can do a kubectl delete --force --grace-period 0 on the pod.
it seems like since this was mounted on behalf of the pod that it could possible be umount ( or force umount after some timeout) on delete automatically.
The attached reproduces this for me on gke
k8s-nfs-test.yaml.txt
run it, then delete it. You will get a ‘nfs-client’ stuck in deleting. The reason is the hard mount on the node, and the ‘server’ is deleted first.
Hitting this with k8s 1.9.6, when kubelet is unable to umount Cephfs mount, all pods on the node stay Terminated forever. Had to restart node to recover, kubelet or docker restart did not help.
I spoke too soon.
Had to destroy it in the brutal fashion.
On GKE, upgrading nodes helped instantly.
Affected by the same bug on GKE. Are there any known workarounds for this issue? Using
--now
does not work.Looks there are to different bug related to with issue. We have both on the our 1.8.3 cluster.
And it’s true directory is not empty, it’s unmounted and contains our “subpath” directory! One of the explanation of such behavior:
Usually volume and network cleanup consume more time in termination. Can you find in which phase your pod is stuck? Volume cleanup for example?
Just to add to the possible causes and for the benefit of those whose google search takes them here. I have a cluster on AWS EKS where namespaces and pods are created/terminated several times a day. Today for the first time I saw a problem where a bunch of pods got stuck in a terminating state for several hours.
I believe the cause of this was a bad underlying node: ➜ ~ kubectl get nodes -A NAME STATUS ROLES AGE VERSION ip-10-x-x-x.ec2.internal NotReady <none> 4h54m v1.16.13-eks-ec92d4
(a standard AWS instance hardware/unresponsive thing)
So what I think happened is that the pods on that node got into a state whereby they could not be terminated. Terminations of pods on good nodes were unaffected. To fix I drained the pods off the bad node and as this in an ASG I was able to just terminate the instance and spin up a new one.
@elrok123 Brilliant - I was indeed ill-informed. I’ve updated my response above, referencing this explanation. Thanks for the detailed response, and a further motivated method for dealing with troublesome pods. Cheers!
<del>Also, the
--force
flag doesn’t necessarily mean the pod is removed, it just doesn’t wait for confirmation (and drops the reference, to my understanding). As stated by the warningThe resource may continue to run on the cluster indefinetely
.</del>Edit: I was ill-informed. See elrok123s comment below for further motivation.
@mikesplain: Reopened this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I am using version 1.10 and I experienced this issue today and I think my problem is related with the issue of mounting secret volume which might have left some task pending and left the pod in termination status forever.
I had to use the --grace-period=0 --force option to terminate the pods.
root@ip-10-31-16-222:/var/log# journalctl -u kubelet | grep dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: I0320 15:50:31.179901 528 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "config-volume" (UniqueName: "kubernetes.io/configmap/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-config-volume") pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds" (UID: "e3d7c57a-4b27-11e9-9aaa-0203c98ff31e") Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: I0320 15:50:31.179935 528 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-xjlgc" (UniqueName: "kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-default-token-xjlgc") pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds" (UID: "e3d7c57a-4b27-11e9-9aaa-0203c98ff31e") Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: I0320 15:50:31.179953 528 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "secret-volume" (UniqueName: "kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume") pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds" (UID: "e3d7c57a-4b27-11e9-9aaa-0203c98ff31e") Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:31.310200 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:31.810156118 +0000 UTC m=+966792.065305175 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:31 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:31.885807 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:32.885784622 +0000 UTC m=+966793.140933656 (durationBeforeRetry 1s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxxx-com\" not found" Mar 20 15:50:32 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:32.987385 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:34.987362044 +0000 UTC m=+966795.242511077 (durationBeforeRetry 2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:35 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:35.090836 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:39.090813114 +0000 UTC m=+966799.345962147 (durationBeforeRetry 4s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:39 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:39.096621 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:50:47.096593013 +0000 UTC m=+966807.351742557 (durationBeforeRetry 8s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:50:47 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:50:47.108644 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:51:03.10862005 +0000 UTC m=+966823.363769094 (durationBeforeRetry 16s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:51:03 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:51:03.133029 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:51:35.133006645 +0000 UTC m=+966855.388155677 (durationBeforeRetry 32s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:51:35 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:51:35.184310 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:52:39.184281161 +0000 UTC m=+966919.439430217 (durationBeforeRetry 1m4s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxx-com\" not found" Mar 20 15:52:34 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:52:34.005027 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:52:34 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:52:34.005085 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc] Mar 20 15:52:39 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:52:39.196332 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:54:41.196308703 +0000 UTC m=+967041.451457738 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:54:41 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:54:41.296252 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:56:43.296229192 +0000 UTC m=+967163.551378231 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:54:48 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:54:48.118620 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:54:48 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:54:48.118681 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc] Mar 20 15:56:43 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:56:43.398396 528 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\" (\"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\")" failed. No retries permitted until 2019-03-20 15:58:45.398368668 +0000 UTC m=+967285.653517703 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"secret-volume\" (UniqueName: \"kubernetes.io/secret/e3d7c57a-4b27-11e9-9aaa-0203c98ff31e-secret-volume\") pod \"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds\" (UID: \"e3d7c57a-4b27-11e9-9aaa-0203c98ff31e\") : secrets \"data-platform.xxxx-com\" not found" Mar 20 15:57:05 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:57:05.118566 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:57:05 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:57:05.118937 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc] Mar 20 15:59:22 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:59:22.118593 528 kubelet.go:1640] Unable to mount volumes for pod "dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)": timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume config-volume default-token-xjlgc]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]; skipping pod Mar 20 15:59:22 ip-10-31-16-222.eu-west-2.compute.internal kubelet[528]: E0320 15:59:22.118624 528 pod_workers.go:186] Error syncing pod e3d7c57a-4b27-11e9-9aaa-0203c98ff31e ("dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds_default(e3d7c57a-4b27-11e9-9aaa-0203c98ff31e)"), skipping: timeout expired waiting for volumes to attach or mount for pod "default"/"dp-tag-change-ingestion-com-depl-5bd59f74c4-589ds". list of unmounted volumes=[secret-volume config-volume default-token-xjlgc]. list of unattached volumes=[secret-volume config-volume default-token-xjlgc]
I am still getting stuck with this issue with k8s v1.11.0. Here is a check-list of what I do to clean up my pods:
kubectl get
; some of them are only known to the Kubelet the pod is running on, so you will have to follow its log stream locallyumount
directories that Kubelet complains about withdevice or resource busy
messageskubectl edit
the failed pod and removefinalizers:
→- foregroundDeletion
Two more tips:
kubectl delete
command blocked in another window to monitor your progress (even on a pod you already “deleted” many times).kubectl delete
will terminate as soon as the last stuck resource gets released.@agolomoodysaada Ah, that makes sense. Thanks for the explanation. So I wouldn’t really know that actual container is really deleted or not right?
When a pod is terminated, we do unmount the volume (assuming that the server is still there). If you are seeing dangling mounts even when the server exists, then that is a bug.
If you use dynamic provisioning with PVCs and PVs, then we don’t allow the PVC (and underlying storage) to be deleted until all Pods referencing it are done using it. If you want to orchestrate the provisioning yourself, then you need to ensure you don’t delete the server until all pods are done using it.
I’m not sure if this is the same issue, but we have started noticing this behaviour since upgrading from 1.9.3 to 10.10.1. It never happened before that. We’re using glusterfs volumes, with SubPath. Kubelet continously logs things like
and lsof shows indeed that the directory under the glusterfs volumes is still in use:
This was all fine on 1.9.3, so it’s as if the fix for this issue has broken our use case 😦
Same issue here on Azure, Kube 1.8.7
For some it might help. We are running kubelet in docker container with
--containerized
flag and were able to solve this issue with mounting/rootfs
,/var/lib/docker
and/var/lib/kubelet
as shared mounts. Final mounts look like thisFor some more details. This does not properly solve the problem as for every bind mount you’ll get 3 mounts inside kubelet container (2 parasite). But at least shared mount allow to easily unmount them with one shot.
CoreOS does not have this problem. Because the use rkt and not docker for kubelet container. In case our case kubelet runs in Docker and every mount inside kubelet continer gets proposed into
/var/lib/docker/overlay/...
and/rootfs
that’s why we have two parasite mounts for every bind mount volume:/rootfs
in/rootfs/var/lib/kubelet/<mount>
/var/lib/docker
in/var/lib/docker/overlay/.../rootfs/var/lib/kubelet/<mount>