kubernetes: Occasional ImagePullBackOff Errors when pulling large docker images

/kind bug

What happened:

Occasionally we’ll run into ImagePullBackOff errors on our alpha GKE cluster on kube v1.8.5-gke.0 when pulling large docker images for containers within a pod. It sometimes won’t work for 30+ mins and then start working again later. The larger docker images we occasionally have issues with are around 1-2GB.

Example events from describing the pod:

Events:
  Type     Reason                 Age               From                                                          Message
  ----     ------                 ----              ----                                                          -------
  Normal   Scheduled              13m               default-scheduler                                             Successfully assigned PODNAME to gke-cluster-gr--cpu-pool-8fae08dd-jmfq
  Normal   SuccessfulMountVolume  13m               kubelet, gke-cluster-gr-XXXX  MountVolume.SetUp succeeded for volume "shared-logs"
  Normal   SuccessfulMountVolume  13m               kubelet, gke-cluster-gr-XXXX  MountVolume.SetUp succeeded for volume "default-token-v7z2n"
  Normal   Pulling                9m                kubelet, gke-cluster-gr-XXXX  pulling image "ANOTHERSMALLIMAGE"
  Normal   Pulling                6m                kubelet, gke-cluster-gr-XXXX  pulling image "SMALLIMAGE"
  Normal   Started                6m                kubelet, gke-cluster-gr-XXXX  Started container
  Normal   Created                6m                kubelet, gke-cluster-gr-XXXX  Created container
  Normal   Pulled                 6m                kubelet, gke-cluster-gr-XXXX  Successfully pulled image "SMALLIMAGE"
  Normal   Pulled                 6m                kubelet, gke-cluster-gr-XXXX  Successfully pulled image "ANOTHERSMALLIMAGE"
  Normal   Created                6m                kubelet, gke-cluster-gr-XXXX  Created container
  Normal   Started                6m                kubelet, gke-cluster-gr-XXXX  Started container
  Normal   BackOff                4m (x4 over 9m)   kubelet, gke-cluster-gr-XXXX  Back-off pulling image "BIGIMAGE"
  Normal   BackOff                4m (x2 over 4m)   kubelet, gke-cluster-gr-XXXX  Back-off pulling image "BIGIMAGE"
  Warning  FailedSync             4m (x4 over 6m)   kubelet, gke-cluster-gr-XXXX  Error syncing pod
  Normal   Pulling                4m (x3 over 13m)  kubelet, gke-cluster-gr-XXXX  pulling image "BIGIMAGE"
  Warning  Failed                 1m (x3 over 9m)   kubelet, gke-cluster-gr-XXXX  Failed to pull image "BIGIMAGE": rpc error: code = Canceled desc = context canceled

I have correctly set-up the docker hub secrets and this has worked no problem on What you expected to happen:

The container to pull the image from dockerhub and continue execution.

How to reproduce it (as minimally and precisely as possible):

Create a GKE cluster and create a pod using a large docker image, for example: nvidia/cuda:8.0-cudnn6-devel-ubuntu14.04 is probably a good candidate as one of our larger images is based on one of the nvidia/cuda images.

You may occassionally experience a ImagePullBackOff

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.8.5-gke.0
  • Cloud provider or hardware configuration: Google Kubernetes Engine, alpha cluster
  • Others: dockerhub, alpha gke cluster

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 13
  • Comments: 54 (29 by maintainers)

Commits related to this issue

Most upvoted comments

LOL. +1 to there is something wrong with the registry, docker or the network. 😃

Did you set the --image-pull-progress-deadline on the kubelet?

I am seeing this with images hosted in a private registry.

Observed on both registry versions 2.5.2 and 2.6.2, hosted internally on the Kubernetes cluster running on AWS with kops.

Kubernetes 1.8.7.

Same context canceled error: Failed to pull image "registry.foo.com/bar:baz": rpc error: code = Canceled desc = context canceled

This is for an image just over 2GB in size.

Had the same issue on GKE with a public image from the docker hub.

Node and master versions are v1.8.8-gke.0

@discordianfish According the the blog post here they had to increase the --image-pull-progress-deadline on the kubelet, as they got rpc error: code = 2 desc = net/http: request canceled errors when pulling large images.

/reopen

How would one configure this in a cloud provider (e.g. GKE) such that the setting preserves across master upgrades and applies to the nodes automatically as the node-pools are scaled up?

I am seeing this same issue when using gcr.io (google container registry). What is interesting however is that doing a ‘docker pull’ works without issue every time.

Just want to follow-up to mention this issue is sometimes being observed for images hosted on GCR as well.

Saw this issue on AKS, pulling from GCR.

EDIT: This was fixed by increasing image-pull-progress-deadline following the workaround here https://github.com/Azure/AKS/issues/245#issuecomment-379779370

I’m trying to get to the root of this.

This will abort the pull if there hasn’t been progress for the deadline: https://github.com/kubernetes/kubernetes/blob/915798d229b7be076d8e53d6aa1573adabd470d2/pkg/kubelet/dockershim/libdocker/kube_docker_client.go#L374

On the Docker side, the progress gets posted by the ProgressReader: https://github.com/moby/moby/blob/53683bd8326b988977650337ee43b281d2830076/distribution/pull_v2.go#L234

Which is suppose to send a progress message at least every 512kb: https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/progress/progressreader.go#L34

So unless there is a bug I missed, the pulls here fail to download 512kb within the default 60s deadline, so there is something wrong with the registry, docker or the network.

Experiencing the same issue with gitlab omnibus registry on 1.8.8-gke.0. But imagesize is 74.04 MiB. It helps to delete the whole namespace and re-deploy with gitlab again. Ocassionally after resizing my cluster nodes I’m getting a backoff when it’s balanced to different nodes for example.

I’ve got a machine in my cluster that reliably spews a machine like "kubelet[2589]: E1002 12:13:46.064879 2589 kube_docker_client.go:341] Cancel pulling image “us.gcr.io/…:4099cd1e356386df36c122fbfff51243674d6433” because of no progress for 1m0s, latest progress: "8bc388a983a5: Download complete "

I’m seeing the same ‘Back-off pulling image “FOO”: rpc error: code = Canceled desc = context canceled’ Using Kube cluster 1.10 and pulling a large image. Using --image-pull-progress-deadline=60m on kublet bypassed the issue, per @woopstar.

I am seeing this with images hosted on ECS

@bvandewalle thanks for mentioning you at seeing the same issue, out of curiousity are you also hosting your images on docker hub? I’m wondering if moving to Google Container Registry will help with these issues.