kubernetes: static pods timeout on image pulling during cluster bringup
A few snaphosts from kubelet logs on problematic node from http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gke-large-cluster/31
I0601 11:14:10.933470 3467 kube_docker_client.go:252] Pulling image "gcr.io/google_containers/fluentd-gcp:1.18": "31b0cd3cfaa9: Downloading [> ] 524.6 kB/74.95 MB"
I0601 11:14:12.622702 3467 server.go:959] GET /healthz: (28.584µs) 200 [[curl/7.26.0] 127.0.0.1:54346]
...
I0601 11:21:10.933445 3467 kube_docker_client.go:252] Pulling image "gcr.io/google_containers/fluentd-gcp:1.18": "92ec6d044cb3: Downloading [====> ] 6.301 MB/65.68 MB"
I0601 11:21:13.201412 3467 server.go:959] GET /healthz: (52.53µs) 200 [[curl/7.26.0] 127.0.0.1:54487]
(Roughly 1MB/min)
I0601 11:42:30.933474 3467 kube_docker_client.go:252] Pulling image "gcr.io/google_containers/fluentd-gcp:1.18": "31b0cd3cfaa9: Downloading [===============> ] 22.57 MB/74.95 MB"
I0601 11:42:34.914121 3467 server.go:959] GET /healthz: (42.01µs) 200 [[curl/7.26.0] 127.0.0.1:54840]
I0601 11:42:39.519003 3467 server.go:959] GET /containerLogs/kube-system/fluentd-cloud-logging-gke-gke-large-cluster-default-pool-1-2ac9437d-caep/fluentd-cloud-logging: (289.353µs) 400
goroutine 9990 [running]:
k8s.io/kubernetes/pkg/httplog.(*respLogger).recordStatus(0xc82107b340, 0x190)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:214 +0xa3
k8s.io/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc82107b340, 0x190)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:193 +0x2b
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader(0xc82115b4a0, 0x190)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/response.go:200 +0x41
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc82115b4a0, 0x190, 0xc8203681e0, 0x9a, 0x0, 0x0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/response.go:180 +0x11a
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteError(0xc82115b4a0, 0x190, 0x7f8e93825028, 0xc820da6e30, 0x0, 0x0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/response.go:165 +0x8d
k8s.io/kubernetes/pkg/kubelet/server.(*Server).getContainerLogs(0xc8205c9040, 0xc82081a4b0, 0xc82115b4a0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:491 +0x14e9
k8s.io/kubernetes/pkg/kubelet/server.(*Server).(k8s.io/kubernetes/pkg/kubelet/server.getContainerLogs)-fm(0xc82081a4b0, 0xc82115b4a0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:344 +0x34
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc82074ae10, 0x7f8e93862f38, 0xc82107b340, 0xc820e3d340)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/emicklei/go-restful/container.go:272 +0xf30
k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Container).(k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.dispatch)-fm(0x7f8e93862f38, 0xc82107b3 [[Go-http-client/1.1] 10.240.1.110:36279]
...
I0601 11:53:20.933428 3467 kube_docker_client.go:252] Pulling image "gcr.io/google_containers/fluentd-gcp:1.18": "31b0cd3cfaa9: Downloading [=====================> ] 32.55 MB/74.95 MB"
I0601 11:53:22.509327 3467 server.go:959] GET /healthz: (26.907µs) 200 [[Go-http-client/1.1] 127.0.0.1:55018]
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 62 (52 by maintainers)
I’m personally not signing up for manaing a download server that’s going to get hammered from 1000-2000 nodes simultaneously pulling a 400mb image, on an n1-standard-32 that’s already pretty close to maxing out.
I probably don’t understand what you’re saying. From my point of view tests are users. Ones that actually care what they get. In this particular case, tests want to have all system pods running, which translates to having fully working cluster (where by fully I mean “with all the monitoring and stuff that we ship with it running”). I don’t think this is very obscure/strange requirement - I think this is how it should be.
Running custom (user) containers in the cluster is something slightly different. It should be working, and we should do whatever we can to mitigate problems with image pull, but it’s not as critical as being able to start (with the definition from previous paragraph) a cluster at all. I think that we should find some way to assure that as long as someone is running out of the box version of k8s, the cluster will start up successfully.
@yujuhong - I believe that we should think how we can solve this problem, not how we can solve this problem in tests. We need a convincing user story for that. Maybe a master should double as a registry for system pods?