kubernetes: Pods get stuck in ContainerCreating state when pulling image takes long
What happened: If pulling a docker image takes a bit longer than usual, pods are never created.
What you expected to happen: Kubernetes to wait or retry pulling.
How to reproduce it (as minimally and precisely as possible): Throttle download speed to the docker registry where you are pulling it from or find one that gives you low bandwidth, e.g. 512kbit/s
Anything else we need to know?: An example where it hangs for over 15m already. Of course, if I go to the worker node and do manual docker pull it’ll complete within a few minutes. We’ve seen even longer age since pod creation. And it’s been happening on older kubernetes versions as well. I bet it’ll be the same on 1.14 and newer.
$ kubectl -n spinnaker describe pod spin-gate-55999bc58-47zz7
Name: spin-gate-55999bc58-47zz7
Namespace: spinnaker
Priority: 0
Node: depkbw102/10.16.53.35
Start Time: Thu, 03 Oct 2019 19:43:27 +0000
Labels: app=gate
load-balancer-spin-gate=true
pod-template-hash=55999bc58
Annotations: cni.projectcalico.org/podIP: 10.23.130.18/32
prometheus.io/path: /prometheus_metrics
prometheus.io/port: 8008
prometheus.io/scrape: true
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/spin-gate-55999bc58
Containers:
gate:
Container ID:
Image: dockerregistry.example.com/devops/spinnaker-gate:v1.15.3-49
Image ID:
Port: 8084/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Readiness: http-get https://:8084/health delay=20s timeout=1s period=10s #success=1 #failure=3
Environment:
JAVA_OPTS: -Xms1g -Xmx4g
DUMMY: dummy10
Mounts:
/opt/spinnaker/certs from spinnaker-ssl (rw)
/opt/spinnaker/config from spinnaker-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gt426 (ro)
monitoring:
Container ID:
Image: gcr.io/spinnaker-marketplace/monitoring-daemon:0.14.0-20190702202823
Image ID:
Port: 8008/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/opt/spinnaker-monitoring/config from monitoring-config (rw)
/opt/spinnaker-monitoring/registry from monitoring-registry (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gt426 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
spinnaker-config:
Type: Secret (a volume populated by a Secret)
SecretName: spinnaker-config
Optional: false
spinnaker-ssl:
Type: Secret (a volume populated by a Secret)
SecretName: spinnaker-ssl
Optional: false
monitoring-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: monitoring-config
Optional: false
monitoring-registry:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: monitoring-registry
Optional: false
default-token-gt426:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gt426
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned spinnaker/spin-gate-55999bc58-47zz7 to depkbw102
Normal Pulling 15m kubelet, depkbw102 pulling image "dockerregistry.example.com/devops/spinnaker-gate:v1.15.3-49"
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.11", GitCommit:"25074a190ef2a07d8b0ed38734f2cb373edfb868", GitTreeState:"clean", BuildDate:"2019-09-18T14:34:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
-
Cloud provider or hardware configuration: VMware ESXi 6.5
-
OS (e.g:
cat /etc/os-release
):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2191.5.0
VERSION_ID=2191.5.0
BUILD_ID=2019-09-04-0357
PRETTY_NAME="Container Linux by CoreOS 2191.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
- Kernel (e.g.
uname -a
):
Linux depkbw102 4.19.68-coreos #1 SMP Wed Sep 4 02:59:18 -00 2019 x86_64 Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz GenuineIntel GNU/Linux
- Install tools:
coreos ova with ignition config passed
- Network plugin and version (if this is a network-related bug): calico 3.5.4
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 7
- Comments: 61 (9 by maintainers)
I am having this issue right now. Pods are either stuck at ContainerCreating or Init because it says it’s pulling the image.
I go to the node that it’s trying to pull the image and pull the docker image manually. It’s still stuck.
Facing ditto the same issue. Also, on my all in one Kube setup, I pulled the docker image manually. It’s still stuck.
Please contact AWS Support!
Have you tried set
--serialize-image-pulls=false
when starting kubelet?By default, that flag is
true
, which means kubelet pulls images one by one. I think the pulling is not stuck, they are just waiting for the current one to be finished.below is the serial image puller, you see, one slow pulling may stuck all the pullings on the node : https://github.com/kubernetes/kubernetes/blob/ca0e694d637f9e6feffd7330f66b081769c1c91b/pkg/kubelet/images/puller.go#L60-L95
Just had to try to support an outage caused by all 2 pods of a deployment getting stuck in image pull. The image is nowhere near gigabytes in size and shouldn’t take more than ~2-3min to pull.
Same problem on Kubernetes cluster installed on Ubuntu 20.04 which is installed on VMware Cloud VMs.
Problem solved by restarting kubelet.
We also see this issue on a regular basis whenever we switch GKE to pre-emptive nodes, rather than dedicated. Unsure if it is because it is pre-emptive or just that pre-emptive nodes are more likely to restart regularly and thus these type of low probability errors now occur more frequently.
Just had this problem affecting 6 nodes of a 7 node cluster. Restarting the kubelet service appears to clear the problem.
Kubernetes v1.18.3
Edit: This correlates with the following error message repeating over and over in the kubelet logs:
Found issue 56850 related to this message, but it’s quite old and specific to RedHat (we’re on Ubuntu server 18.04) so I don’t think it’s exactly the same.
i see this as well on eks (image size is 2GB)
For anyone else wondering, I’ve moved off of GCP and the problem hasn’t happened since then.
Have the same issue. Kubernetes v1.18.3, docker 19.03.9.
I think this should be dockerd’s bug. By restart dockerd, triggerd the pod pull iamge again, and it will work.
And I checked dockerd’s thread stac, found a pull image thread stuck in
pullschema2layers
.As long as this thread exists, the new pod on the node can no longer pull the image
I have observed this issue where at least 20 pods were stuck in the same state mentioned in the issue description. I just restarted the kubelet service and all the pods came up. It will be helpful if someone could help on how to prevent this from happening. 😕
/remove-lifecycle rotten
This issue needs to be taken care of. It is a major problem.