origin: Timeout when pulling Docker images taking more than 1 minute to extract

Version

$ oc version oc v1.4.1+3f9807a kubernetes v1.4.0+776c994

OpenShift/Kubernetes fails to pull images whose layers take more than one minute to extract.

$ oc get events -w
Pod                                                   Normal    Scheduled           {default-scheduler }             Successfully assigned gitlab-ee-1-3jso0 to oonodedev-001
Pod                     spec.containers{gitlab-ee}    Normal    Pulling             {kubelet oonodedev-001}   pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff"
Pod                   Warning   FailedSync   {kubelet oonodedev-001}   Error syncing pod, skipping: failed to "StartContainer" for "gitlab-ee" with ErrImagePull: "net/http: request canceled"
Pod       spec.containers{gitlab-ee}   Warning   Failed    {kubelet oonodedev-001}   Failed to pull image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff": net/http: request canceled

and in the Origin logs:

Feb 24 15:21:45 oonodedev-001 origin-node[20126] kube_docker_client.go:313] Cancel pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff" because of no progress for 1m0s, latest progress: "ac990a380700: Extracting [==================================================>] 288.7 MB/288.7 MB"

The last layer of this particular image (ie gitlab/gitlab-ee:8.16.4-ee.0) takes several minutes to extract and with the default timeout of 1 minute it never goes through. A normal docker pull works.

The one minute value seems to come from the value of defaultImagePullingStuckTimeout (ref. https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/kube_docker_client.go#L81) which is hardcoded and can’t be changed. I’m also seeing this has been changed in Kubernetes 1.6 and the value looks to be customizable.

Could you suggest a possible workaround for the time being? If not, could we increase the default timeout (to something like 10 minutes) and backport it to Origin 1.4 and Origin 1.5?

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 9
Comments: 23 (7 by maintainers)

Commits related to this issue

Merge pull request #4046 from artsy/master Automatic merge from submit-queue. add imagePullProgressDeadline to kubelet config Support the kubelet runtime flag `--image-pull-progress-deadline` by ma... — committed to kubernetes/kops by deleted user 7 years ago

Most upvoted comments

@alikhajeh1 @bbrfkr @rickbliss @yanhongwang

For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.

This is working for us.

AlbertoPeon on Sep 18, 2017

Actually, I am happy to close the issue now that this is configurable in Origin 3.6.

AlbertoPeon on Sep 18, 2017

@xqianwang Yes. We can set the parameter image-pull-progress-deadline into /etc/origin/node/node-config.yaml as follow;

kubeletArguments:
  image-pull-progress-deadline:
  - "10m"

This description works fine in my OpenShift Origin environment.

bbrfkr on Dec 1, 2017

Changed EC2 nodes from t2.medium to m3.large and fixed the problem

alifa20 on Jul 18, 2017

+1 happens to me, 1.5.7 using kops. I am getting with ErrImagePull: "net/http: request canceled" tries to get the image from AWS ECR. Any ideas guys?

alifa20 on Jul 18, 2017