kubernetes: Occasional ImagePullBackOff Errors when pulling large docker images

/kind bug

What happened:

Occasionally we’ll run into ImagePullBackOff errors on our alpha GKE cluster on kube v1.8.5-gke.0 when pulling large docker images for containers within a pod. It sometimes won’t work for 30+ mins and then start working again later. The larger docker images we occasionally have issues with are around 1-2GB.

Example events from describing the pod:

Events:
  Type     Reason                 Age               From                                                          Message
  ----     ------                 ----              ----                                                          -------
  Normal   Scheduled              13m               default-scheduler                                             Successfully assigned PODNAME to gke-cluster-gr--cpu-pool-8fae08dd-jmfq
  Normal   SuccessfulMountVolume  13m               kubelet, gke-cluster-gr-XXXX  MountVolume.SetUp succeeded for volume "shared-logs"
  Normal   SuccessfulMountVolume  13m               kubelet, gke-cluster-gr-XXXX  MountVolume.SetUp succeeded for volume "default-token-v7z2n"
  Normal   Pulling                9m                kubelet, gke-cluster-gr-XXXX  pulling image "ANOTHERSMALLIMAGE"
  Normal   Pulling                6m                kubelet, gke-cluster-gr-XXXX  pulling image "SMALLIMAGE"
  Normal   Started                6m                kubelet, gke-cluster-gr-XXXX  Started container
  Normal   Created                6m                kubelet, gke-cluster-gr-XXXX  Created container
  Normal   Pulled                 6m                kubelet, gke-cluster-gr-XXXX  Successfully pulled image "SMALLIMAGE"
  Normal   Pulled                 6m                kubelet, gke-cluster-gr-XXXX  Successfully pulled image "ANOTHERSMALLIMAGE"
  Normal   Created                6m                kubelet, gke-cluster-gr-XXXX  Created container
  Normal   Started                6m                kubelet, gke-cluster-gr-XXXX  Started container
  Normal   BackOff                4m (x4 over 9m)   kubelet, gke-cluster-gr-XXXX  Back-off pulling image "BIGIMAGE"
  Normal   BackOff                4m (x2 over 4m)   kubelet, gke-cluster-gr-XXXX  Back-off pulling image "BIGIMAGE"
  Warning  FailedSync             4m (x4 over 6m)   kubelet, gke-cluster-gr-XXXX  Error syncing pod
  Normal   Pulling                4m (x3 over 13m)  kubelet, gke-cluster-gr-XXXX  pulling image "BIGIMAGE"
  Warning  Failed                 1m (x3 over 9m)   kubelet, gke-cluster-gr-XXXX  Failed to pull image "BIGIMAGE": rpc error: code = Canceled desc = context canceled

I have correctly set-up the docker hub secrets and this has worked no problem on What you expected to happen:

The container to pull the image from dockerhub and continue execution.

How to reproduce it (as minimally and precisely as possible):

Create a GKE cluster and create a pod using a large docker image, for example: nvidia/cuda:8.0-cudnn6-devel-ubuntu14.04 is probably a good candidate as one of our larger images is based on one of the nvidia/cuda images.

You may occassionally experience a ImagePullBackOff

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.8.5-gke.0
Cloud provider or hardware configuration: Google Kubernetes Engine, alpha cluster
Others: dockerhub, alpha gke cluster

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 13
Comments: 54 (29 by maintainers)

Commits related to this issue

Set parallel downloads in Docker to 15 Kubernetes pulls up to 10 images in parallel. If Docker's parallel downloads is less than this, some downloads might stale and cause ImagePullBackOff errors. h... — committed to itskoko/kubecfn by discordianfish 6 years ago
sys: set image-pull-progress-deadline to 60m https://github.com/kubernetes/kubernetes/issues/59376 — committed to utilitywarehouse/tf_kube_ignition by george-angel 6 years ago
Try to avoid ImagePull problems We had some image pulling problems that failed deployments and according to https://github.com/kubernetes/kubernetes/issues/59376 changing the `imagePullProgressDeadli... — committed to Gusto/ubuntu-eks-ami by deleted user 5 years ago

Most upvoted comments

LOL. +1 to there is something wrong with the registry, docker or the network. 😃

dims on May 1, 2018

Did you set the --image-pull-progress-deadline on the kubelet?

woopstar on Feb 20, 2018

I am seeing this with images hosted in a private registry.

Observed on both registry versions 2.5.2 and 2.6.2, hosted internally on the Kubernetes cluster running on AWS with kops.

Kubernetes 1.8.7.

Same context canceled error: Failed to pull image "registry.foo.com/bar:baz": rpc error: code = Canceled desc = context canceled

This is for an image just over 2GB in size.

AlexB138 on Feb 20, 2018

Had the same issue on GKE with a public image from the docker hub.

Node and master versions are v1.8.8-gke.0

m1kola on Apr 21, 2018

@discordianfish According the the blog post here they had to increase the --image-pull-progress-deadline on the kubelet, as they got rpc error: code = 2 desc = net/http: request canceled errors when pulling large images.

woopstar on Mar 30, 2018

/reopen

How would one configure this in a cloud provider (e.g. GKE) such that the setting preserves across master upgrades and applies to the nodes automatically as the node-pools are scaled up?

clarketm on Dec 5, 2019

I am seeing this same issue when using gcr.io (google container registry). What is interesting however is that doing a ‘docker pull’ works without issue every time.

rrichards on Jun 4, 2018

Just want to follow-up to mention this issue is sometimes being observed for images hosted on GCR as well.

kevincvlam on Feb 7, 2018

Saw this issue on AKS, pulling from GCR.

EDIT: This was fixed by increasing image-pull-progress-deadline following the workaround here https://github.com/Azure/AKS/issues/245#issuecomment-379779370

achanda on Jun 6, 2018

I’m trying to get to the root of this.

This will abort the pull if there hasn’t been progress for the deadline: https://github.com/kubernetes/kubernetes/blob/915798d229b7be076d8e53d6aa1573adabd470d2/pkg/kubelet/dockershim/libdocker/kube_docker_client.go#L374

On the Docker side, the progress gets posted by the ProgressReader: https://github.com/moby/moby/blob/53683bd8326b988977650337ee43b281d2830076/distribution/pull_v2.go#L234

Which is suppose to send a progress message at least every 512kb: https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/progress/progressreader.go#L34

So unless there is a bug I missed, the pulls here fail to download 512kb within the default 60s deadline, so there is something wrong with the registry, docker or the network.

discordianfish on May 1, 2018

Experiencing the same issue with gitlab omnibus registry on 1.8.8-gke.0. But imagesize is 74.04 MiB. It helps to delete the whole namespace and re-deploy with gitlab again. Ocassionally after resizing my cluster nodes I’m getting a backoff when it’s balanced to different nodes for example.

AmazingTurtle on Apr 26, 2018

@bryanlarsen did u see the tips i referred to earlier? ( in https://blog.openai.com/scaling-kubernetes-to-2500-nodes/#dockerimagepulls )

dims on Oct 2, 2018

I’ve got a machine in my cluster that reliably spews a machine like "kubelet[2589]: E1002 12:13:46.064879 2589 kube_docker_client.go:341] Cancel pulling image “us.gcr.io/…:4099cd1e356386df36c122fbfff51243674d6433” because of no progress for 1m0s, latest progress: "8bc388a983a5: Download complete "

bryanlarsen on Oct 2, 2018

I’m seeing the same ‘Back-off pulling image “FOO”: rpc error: code = Canceled desc = context canceled’ Using Kube cluster 1.10 and pulling a large image. Using --image-pull-progress-deadline=60m on kublet bypassed the issue, per @woopstar.

DamonStamper on Apr 30, 2018

I am seeing this with images hosted on ECS

qwertme on Feb 18, 2018

@bvandewalle thanks for mentioning you at seeing the same issue, out of curiousity are you also hosting your images on docker hub? I’m wondering if moving to Google Container Registry will help with these issues.

kevincvlam on Feb 6, 2018