origin: Pods stuck in Terminating in 3.2.0.5
Created a lot of pods (1000 across 2 nodes, approx 500 per node). Then deleted the namespace:
# oc delete ns clusterproject0
It started doing something because only 319 pods remain out of the 1000. But now it refuses to go further. Environment state and logs are below.
I haven’t seen this before – last similar run was on 3.2.0.1, though I only went to 250 pods per node at that code level (things worked ok).
# openshift version
openshift v3.2.0.5
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5
# oc get no
NAME STATUS AGE
dell-r620-01.perf.lab.eng.rdu.redhat.com Ready,SchedulingDisabled 19d
dell-r730-01.perf.lab.eng.rdu.redhat.com Ready 19d
dell-r730-02.perf.lab.eng.rdu.redhat.com Ready 19d
# oc get ns
NAME STATUS AGE
clusterproject0 Terminating 1d
default Active 19d
management-infra Active 19d
openshift Active 19d
openshift-infra Active 19d
# oc delete ns clusterproject0
Error from server: namespaces "clusterproject0" cannot be updated: The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system.
All of the stuck-in-terminating pods were scheduled and run on one of the nodes. The other node was able to successfully terminate all pods that were running there.
From the master:
Mar 21 12:18:36 dell-r620-01.perf.lab.eng.rdu.redhat.com atomic-openshift-master[6683]: E0321 12:18:36.451332 6683 namespace_controller.go:139] unexpected items still remain in namespace: clusterproject0 for gvr: { v1 pods}
Mar 21 12:18:37 dell-r620-01.perf.lab.eng.rdu.redhat.com atomic-openshift-master[6683]: W0321 12:18:37.223252 6683 reflector.go:289] /usr/lib/golang/src/runtime/asm_amd64.s:2232: watch of *api.ServiceAccount ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [24072046/23943693]) [24073045]
From the node that can’t terminate it’s pods:
Mar 21 12:22:49 dell-r730-01.perf.lab.eng.rdu.redhat.com atomic-openshift-node[7438]: W0321 12:22:49.119784 7438 kubelet.go:1850] Unable to retrieve pull secret clusterproject0/default-dockercfg-woua1 for clusterproject0/hellopods505 due to secrets "default-dockercfg-woua1" not found. The image pull may not succeed.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 59 (52 by maintainers)
@metal3d
oc delete pod/<name of pod> --grace-period=0will force deletion.@anandbaskaran if command provided by @ncdc still hangs, you can try forcing it with --force. It gives: oc delete pod/<name of pod> --grace-period=0 --force
Upstream PR: https://github.com/kubernetes/kubernetes/pull/23746
I was finally able to reproduce this on a 3 node cluster after many hours of creating 500 pods, waiting for 200 to run, and then tearing down the project. For many hours, it was fine, but I did notice nodes were getting less and less successful at getting the full set of 200 pods in a running state as test ran more and more, but it was always able to properly tear-down all the pods.
After much time, I was left in a state where finally 1 of the nodes failed to tear down 3 pods. The node continued to report a valid heartbeat back to the API server, but it would no longer launch new pods that were scheduled to it. I was able to ssh into the machine and do a little more sleuthing.
The
kubeletactually did see the notification from the watch source about the pod:Looking at the container in question:
You can see it was stuck in waiting state with container creating status.
Looking at the docker logs:
So my current theory is the following:
I have stashed the logs away to analyze more tomorrow.