kubernetes: Occasional ImagePullBackOff Errors when pulling large docker images
/kind bug
What happened:
Occasionally we’ll run into ImagePullBackOff errors on our alpha GKE cluster on kube v1.8.5-gke.0 when pulling large docker images for containers within a pod. It sometimes won’t work for 30+ mins and then start working again later. The larger docker images we occasionally have issues with are around 1-2GB.
Example events from describing the pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned PODNAME to gke-cluster-gr--cpu-pool-8fae08dd-jmfq
Normal SuccessfulMountVolume 13m kubelet, gke-cluster-gr-XXXX MountVolume.SetUp succeeded for volume "shared-logs"
Normal SuccessfulMountVolume 13m kubelet, gke-cluster-gr-XXXX MountVolume.SetUp succeeded for volume "default-token-v7z2n"
Normal Pulling 9m kubelet, gke-cluster-gr-XXXX pulling image "ANOTHERSMALLIMAGE"
Normal Pulling 6m kubelet, gke-cluster-gr-XXXX pulling image "SMALLIMAGE"
Normal Started 6m kubelet, gke-cluster-gr-XXXX Started container
Normal Created 6m kubelet, gke-cluster-gr-XXXX Created container
Normal Pulled 6m kubelet, gke-cluster-gr-XXXX Successfully pulled image "SMALLIMAGE"
Normal Pulled 6m kubelet, gke-cluster-gr-XXXX Successfully pulled image "ANOTHERSMALLIMAGE"
Normal Created 6m kubelet, gke-cluster-gr-XXXX Created container
Normal Started 6m kubelet, gke-cluster-gr-XXXX Started container
Normal BackOff 4m (x4 over 9m) kubelet, gke-cluster-gr-XXXX Back-off pulling image "BIGIMAGE"
Normal BackOff 4m (x2 over 4m) kubelet, gke-cluster-gr-XXXX Back-off pulling image "BIGIMAGE"
Warning FailedSync 4m (x4 over 6m) kubelet, gke-cluster-gr-XXXX Error syncing pod
Normal Pulling 4m (x3 over 13m) kubelet, gke-cluster-gr-XXXX pulling image "BIGIMAGE"
Warning Failed 1m (x3 over 9m) kubelet, gke-cluster-gr-XXXX Failed to pull image "BIGIMAGE": rpc error: code = Canceled desc = context canceled
I have correctly set-up the docker hub secrets and this has worked no problem on What you expected to happen:
The container to pull the image from dockerhub and continue execution.
How to reproduce it (as minimally and precisely as possible):
Create a GKE cluster and create a pod using a large docker image, for example: nvidia/cuda:8.0-cudnn6-devel-ubuntu14.04 is probably a good candidate as one of our larger images is based on one of the nvidia/cuda images.
You may occassionally experience a ImagePullBackOff
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): v1.8.5-gke.0 - Cloud provider or hardware configuration: Google Kubernetes Engine, alpha cluster
- Others: dockerhub, alpha gke cluster
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 13
- Comments: 54 (29 by maintainers)
Commits related to this issue
- Set parallel downloads in Docker to 15 Kubernetes pulls up to 10 images in parallel. If Docker's parallel downloads is less than this, some downloads might stale and cause ImagePullBackOff errors. h... — committed to itskoko/kubecfn by discordianfish 6 years ago
- sys: set image-pull-progress-deadline to 60m https://github.com/kubernetes/kubernetes/issues/59376 — committed to utilitywarehouse/tf_kube_ignition by george-angel 6 years ago
- Try to avoid ImagePull problems We had some image pulling problems that failed deployments and according to https://github.com/kubernetes/kubernetes/issues/59376 changing the `imagePullProgressDeadli... — committed to Gusto/ubuntu-eks-ami by deleted user 5 years ago
LOL. +1 to
there is something wrong with the registry, docker or the network.
😃Did you set the
--image-pull-progress-deadline
on the kubelet?I am seeing this with images hosted in a private registry.
Observed on both registry versions 2.5.2 and 2.6.2, hosted internally on the Kubernetes cluster running on AWS with kops.
Kubernetes 1.8.7.
Same context canceled error:
Failed to pull image "registry.foo.com/bar:baz": rpc error: code = Canceled desc = context canceled
This is for an image just over 2GB in size.
Had the same issue on GKE with a public image from the docker hub.
Node and master versions are
v1.8.8-gke.0
@discordianfish According the the blog post here they had to increase the
--image-pull-progress-deadline
on the kubelet, as they gotrpc error: code = 2 desc = net/http: request canceled
errors when pulling large images./reopen
How would one configure this in a cloud provider (e.g. GKE) such that the setting preserves across master upgrades and applies to the nodes automatically as the node-pools are scaled up?
I am seeing this same issue when using gcr.io (google container registry). What is interesting however is that doing a ‘docker pull’ works without issue every time.
Just want to follow-up to mention this issue is sometimes being observed for images hosted on GCR as well.
Saw this issue on AKS, pulling from GCR.
EDIT: This was fixed by increasing
image-pull-progress-deadline
following the workaround here https://github.com/Azure/AKS/issues/245#issuecomment-379779370I’m trying to get to the root of this.
This will abort the pull if there hasn’t been progress for the deadline: https://github.com/kubernetes/kubernetes/blob/915798d229b7be076d8e53d6aa1573adabd470d2/pkg/kubelet/dockershim/libdocker/kube_docker_client.go#L374
On the Docker side, the progress gets posted by the ProgressReader: https://github.com/moby/moby/blob/53683bd8326b988977650337ee43b281d2830076/distribution/pull_v2.go#L234
Which is suppose to send a progress message at least every 512kb: https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/progress/progressreader.go#L34
So unless there is a bug I missed, the pulls here fail to download 512kb within the default 60s deadline, so there is something wrong with the registry, docker or the network.
Experiencing the same issue with gitlab omnibus registry on
1.8.8-gke.0
. But imagesize is 74.04 MiB. It helps to delete the whole namespace and re-deploy with gitlab again. Ocassionally after resizing my cluster nodes I’m getting a backoff when it’s balanced to different nodes for example.@bryanlarsen did u see the tips i referred to earlier? ( in https://blog.openai.com/scaling-kubernetes-to-2500-nodes/#dockerimagepulls )
I’ve got a machine in my cluster that reliably spews a machine like "kubelet[2589]: E1002 12:13:46.064879 2589 kube_docker_client.go:341] Cancel pulling image “us.gcr.io/…:4099cd1e356386df36c122fbfff51243674d6433” because of no progress for 1m0s, latest progress: "8bc388a983a5: Download complete "
I’m seeing the same ‘Back-off pulling image “FOO”: rpc error: code = Canceled desc = context canceled’ Using Kube cluster 1.10 and pulling a large image. Using
--image-pull-progress-deadline=60m
on kublet bypassed the issue, per @woopstar.I am seeing this with images hosted on ECS
@bvandewalle thanks for mentioning you at seeing the same issue, out of curiousity are you also hosting your images on docker hub? I’m wondering if moving to Google Container Registry will help with these issues.