kubernetes: Pods get stuck in ContainerCreating state when pulling image takes long

What happened: If pulling a docker image takes a bit longer than usual, pods are never created.

What you expected to happen: Kubernetes to wait or retry pulling.

How to reproduce it (as minimally and precisely as possible): Throttle download speed to the docker registry where you are pulling it from or find one that gives you low bandwidth, e.g. 512kbit/s

Anything else we need to know?: An example where it hangs for over 15m already. Of course, if I go to the worker node and do manual docker pull it’ll complete within a few minutes. We’ve seen even longer age since pod creation. And it’s been happening on older kubernetes versions as well. I bet it’ll be the same on 1.14 and newer.

$ kubectl -n spinnaker describe pod spin-gate-55999bc58-47zz7
Name:           spin-gate-55999bc58-47zz7
Namespace:      spinnaker
Priority:       0
Node:           depkbw102/10.16.53.35
Start Time:     Thu, 03 Oct 2019 19:43:27 +0000
Labels:         app=gate
                load-balancer-spin-gate=true
                pod-template-hash=55999bc58
Annotations:    cni.projectcalico.org/podIP: 10.23.130.18/32
                prometheus.io/path: /prometheus_metrics
                prometheus.io/port: 8008
                prometheus.io/scrape: true
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/spin-gate-55999bc58
Containers:
  gate:
    Container ID:   
    Image:          dockerregistry.example.com/devops/spinnaker-gate:v1.15.3-49
    Image ID:       
    Port:           8084/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Readiness:      http-get https://:8084/health delay=20s timeout=1s period=10s #success=1 #failure=3
    Environment:
      JAVA_OPTS:  -Xms1g -Xmx4g
      DUMMY:      dummy10
    Mounts:
      /opt/spinnaker/certs from spinnaker-ssl (rw)
      /opt/spinnaker/config from spinnaker-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gt426 (ro)
  monitoring:
    Container ID:   
    Image:          gcr.io/spinnaker-marketplace/monitoring-daemon:0.14.0-20190702202823
    Image ID:       
    Port:           8008/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /opt/spinnaker-monitoring/config from monitoring-config (rw)
      /opt/spinnaker-monitoring/registry from monitoring-registry (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gt426 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  spinnaker-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spinnaker-config
    Optional:    false
  spinnaker-ssl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spinnaker-ssl
    Optional:    false
  monitoring-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      monitoring-config
    Optional:  false
  monitoring-registry:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      monitoring-registry
    Optional:  false
  default-token-gt426:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-gt426
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                Message
  ----    ------     ----  ----                -------
  Normal  Scheduled  15m   default-scheduler   Successfully assigned spinnaker/spin-gate-55999bc58-47zz7 to depkbw102
  Normal  Pulling    15m   kubelet, depkbw102  pulling image "dockerregistry.example.com/devops/spinnaker-gate:v1.15.3-49"

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.11", GitCommit:"25074a190ef2a07d8b0ed38734f2cb373edfb868", GitTreeState:"clean", BuildDate:"2019-09-18T14:34:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: VMware ESXi 6.5

  • OS (e.g: cat /etc/os-release):

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2191.5.0
VERSION_ID=2191.5.0
BUILD_ID=2019-09-04-0357
PRETTY_NAME="Container Linux by CoreOS 2191.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
  • Kernel (e.g. uname -a):
Linux depkbw102 4.19.68-coreos #1 SMP Wed Sep 4 02:59:18 -00 2019 x86_64 Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz GenuineIntel GNU/Linux
  • Install tools:

coreos ova with ignition config passed

  • Network plugin and version (if this is a network-related bug): calico 3.5.4

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 7
  • Comments: 61 (9 by maintainers)

Most upvoted comments

I am having this issue right now. Pods are either stuck at ContainerCreating or Init because it says it’s pulling the image.

I go to the node that it’s trying to pull the image and pull the docker image manually. It’s still stuck.

Same problem here, pods get stuck on ContainerCreating step. when trying a kubectl describe pod, it indicate that last event is a Pulling event ie :

Normal Pulling 13m kubelet, franck-lenovo-z70-80 Pulling image “jboss/keycloak”

Facing ditto the same issue. Also, on my all in one Kube setup, I pulled the docker image manually. It’s still stuck.

Events:
  Type    Reason     Age   From                     Message
  ----    ------     ----  ----                     -------
  Normal  Scheduled  29m   default-scheduler        Successfully assigned <namespace>/<pod-that-is-stuck> to docker-desktop
  Normal  Pulling    29m   kubelet, docker-desktop  Pulling image <image-for-the-pod-that-is-stuck>

FYI: this is happening in EKS version 1.27 (containerd) on AWS for a docker image of size ~1.GB

Please contact AWS Support!

Have you tried set --serialize-image-pulls=false when starting kubelet?

By default, that flag is true, which means kubelet pulls images one by one. I think the pulling is not stuck, they are just waiting for the current one to be finished.

below is the serial image puller, you see, one slow pulling may stuck all the pullings on the node : https://github.com/kubernetes/kubernetes/blob/ca0e694d637f9e6feffd7330f66b081769c1c91b/pkg/kubelet/images/puller.go#L60-L95

Just had to try to support an outage caused by all 2 pods of a deployment getting stuck in image pull. The image is nowhere near gigabytes in size and shouldn’t take more than ~2-3min to pull.

Same problem on Kubernetes cluster installed on Ubuntu 20.04 which is installed on VMware Cloud VMs.

  • Kubernetes v1.21.3

Problem solved by restarting kubelet.

We also see this issue on a regular basis whenever we switch GKE to pre-emptive nodes, rather than dedicated. Unsure if it is because it is pre-emptive or just that pre-emptive nodes are more likely to restart regularly and thus these type of low probability errors now occur more frequently.

Just had this problem affecting 6 nodes of a 7 node cluster. Restarting the kubelet service appears to clear the problem.

Kubernetes v1.18.3

Edit: This correlates with the following error message repeating over and over in the kubelet logs:

Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"

Found issue 56850 related to this message, but it’s quite old and specific to RedHat (we’re on Ubuntu server 18.04) so I don’t think it’s exactly the same.

i see this as well on eks (image size is 2GB)

For anyone else wondering, I’ve moved off of GCP and the problem hasn’t happened since then.

Have the same issue. Kubernetes v1.18.3, docker 19.03.9.

I think this should be dockerd’s bug. By restart dockerd, triggerd the pod pull iamge again, and it will work.

And I checked dockerd’s thread stac, found a pull image thread stuck in pullschema2layers.

As long as this thread exists, the new pod on the node can no longer pull the image

image

I have observed this issue where at least 20 pods were stuck in the same state mentioned in the issue description. I just restarted the kubelet service and all the pods came up. It will be helpful if someone could help on how to prevent this from happening. 😕

/remove-lifecycle rotten

This issue needs to be taken care of. It is a major problem.