kubernetes: Container pods stuck ins tate unknown

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z",GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.3", GitCommit:"029c3a408176b55c30846f0faedf56aae5992e9b", GitTreeState:"clean", BuildDate:"2017-02-15T06:34:56Z",GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: GKE

What happened: Pods are stuck in “Unknown” state. Those pods can’t be deleted. Cluster got into this state when I have been submitting many pods at once.

What you expected to happen: I can delete those pods.

Anything else we need to know: Broken pods have following two lines of status (from “kubectl get pods”) that look like this :

  43m   43m     4       {kubelet gke-caffe-node-pool-2-c3b69e45-qxtj}                                                                           Warning FailedSync              Errorsyncing pod, skipping: network is not ready: [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
  36m   36m     1       {controllermanager }                                                                                                    Normal  NodeControllerEviction  Marking for deletion Pod downloader-21f7208a-f326-4a94-b2a5-80a84ef94aa3-wxzfn from Node gke-caffe-node-pool-2-c3b69e45-qxtj

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 10
  • Comments: 21 (2 by maintainers)

Most upvoted comments

Use the kubectl delete pods <unknown pods name> --grace-period=0 --force

In my experience this means that you have a wedged docker daemon. kubectl get pods -o wide | grep Unknown will help you determine which node might be a fault. My general remediation strategy is to not fiddle with docker and just terminate the node.

Use the kubectl delete pods <unknown pods name> --grace-period=0 --force

kubectl delete pods --all --grace-period=0 --force

Stops showing the pod in get pods output but not clear if deletion actually occurs especially given the message received after running kubectl delete pods <unknown pods name> --grace-period=0 --force:

warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.

For people who end up here like me, I wanted to share my solution(for kubernetes v1.10). I ran help for delete pod: $ kubectl delete pod --help It has following usage example and it worked for me:

# Force delete a pod on a dead node
  kubectl delete pod foo --grace-period=0 --force

In our case, the pod was running on a node in an aws EC2 instance, the solution was to restart the node and that did the trick.

The only thing that worked for me was to replace all nodes at once. Through working with the GCP support team, we figured out that the issue was triggered by not having memory limits on the pods, which was causing the oomkiller to run on the servers, sometimes killing processes it shouldn’t. Even worse, the scheduler rescheduled these troublesome pods on other nodes, effectually poisoning the entire cluster. This is definitely something that should be prevented, but can at least be mitigated by setting default memory limits and making sure the limits on your pods are not too high.