origin: Timeout when pulling Docker images taking more than 1 minute to extract

Version

$ oc version oc v1.4.1+3f9807a kubernetes v1.4.0+776c994

OpenShift/Kubernetes fails to pull images whose layers take more than one minute to extract.

$ oc get events -w
Pod                                                   Normal    Scheduled           {default-scheduler }             Successfully assigned gitlab-ee-1-3jso0 to oonodedev-001
Pod                     spec.containers{gitlab-ee}    Normal    Pulling             {kubelet oonodedev-001}   pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff"
Pod                   Warning   FailedSync   {kubelet oonodedev-001}   Error syncing pod, skipping: failed to "StartContainer" for "gitlab-ee" with ErrImagePull: "net/http: request canceled"
Pod       spec.containers{gitlab-ee}   Warning   Failed    {kubelet oonodedev-001}   Failed to pull image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff": net/http: request canceled

and in the Origin logs:

Feb 24 15:21:45 oonodedev-001 origin-node[20126] kube_docker_client.go:313] Cancel pulling image "gitlab/gitlab-ee@sha256:fa58a6765b5431f716ba82f5002a81041224e7430ef2c29b7fdea993a4a96aff" because of no progress for 1m0s, latest progress: "ac990a380700: Extracting [==================================================>] 288.7 MB/288.7 MB"

The last layer of this particular image (ie gitlab/gitlab-ee:8.16.4-ee.0) takes several minutes to extract and with the default timeout of 1 minute it never goes through. A normal docker pull works.

The one minute value seems to come from the value of defaultImagePullingStuckTimeout (ref. https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/kube_docker_client.go#L81) which is hardcoded and can’t be changed. I’m also seeing this has been changed in Kubernetes 1.6 and the value looks to be customizable.

Could you suggest a possible workaround for the time being? If not, could we increase the default timeout (to something like 10 minutes) and backport it to Origin 1.4 and Origin 1.5?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 9
  • Comments: 23 (7 by maintainers)

Commits related to this issue

Most upvoted comments

@alikhajeh1 @bbrfkr @rickbliss @yanhongwang

For Origin 3.6 you can set image-pull-progress-deadline to a meaningful value (e.g 10m) in the KubeletArguments section of the node-config.yaml of all your nodes.

This is working for us.

Actually, I am happy to close the issue now that this is configurable in Origin 3.6.

@xqianwang Yes. We can set the parameter image-pull-progress-deadline into /etc/origin/node/node-config.yaml as follow;

kubeletArguments:
  image-pull-progress-deadline:
  - "10m"

This description works fine in my OpenShift Origin environment.

Changed EC2 nodes from t2.medium to m3.large and fixed the problem

+1 happens to me, 1.5.7 using kops. I am getting with ErrImagePull: "net/http: request canceled" tries to get the image from AWS ECR. Any ideas guys?