kind: Rare "failed to init node with kubeadm" on Kind v0.7.0

What happened: Occasionally find errors like https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/20244/integ-telemetry-k8s-tests_istio/7089 failing to start up the kind cluster. I have seen similar errors before, but its very rare, anecdotally I see one a week or so across 1000s of tests. (meta comment: is there a good way to grep across all test logs in GCS? I cannot find one. I used to download them locally but ran out of disk space).

What you expected to happen: ideally no errors

How to reproduce it (as minimally and precisely as possible): I cannot repro

Anything else we need to know?: I don’t really expect too much here. As I see errors in the future I’ll add more context to maybe root cause this. Right now I am not too concerned with this as its extremely uncommon, mostly just opening this to track or if it helps find some issue

Environment:

  • kind version: (use kind version): v0.7.0
  • Kubernetes version: (use kubectl version): v1.17.0
  • Docker version: (use docker info): 19.03.2
  • OS (e.g. from /etc/os-release): linux

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

hi,

We encountered this issue too and found that the sync command in entrypoint script may run for a long time sometimes (high system load?).

https://github.com/kubernetes-sigs/kind/blob/c0a7803bc09961b7a6b84f48fa98fed172812320/images/base/files/usr/local/bin/entrypoint#L28-L31

Maybe we can skip this command when storage driver is not aufs?

example fail logs:

...
[2020-02-05T06:23:26.578Z] I0205 06:23:24.394769      45 round_trippers.go:438] GET https://172.17.0.3:6443/healthz?timeout=32s  in 0 milliseconds
[2020-02-05T06:23:26.578Z] [kubelet-check] It seems like the kubelet isn't running or healthy.
[2020-02-05T06:23:26.578Z] [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.
[2020-02-05T06:23:26.578Z] 
[2020-02-05T06:23:26.578Z] Unfortunately, an error has occurred:
[2020-02-05T06:23:26.578Z] 	couldn't initialize a Kubernetes cluster
[2020-02-05T06:23:26.579Z] timed out waiting for the condition
[2020-02-05T06:23:26.579Z] 
[2020-02-05T06:23:26.579Z] This error is likely caused by:
[2020-02-05T06:23:26.579Z] 	- The kubelet is not running
[2020-02-05T06:23:26.579Z] 	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
[2020-02-05T06:23:26.579Z] 
[2020-02-05T06:23:26.579Z] If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
[2020-02-05T06:23:26.579Z] 	- 'systemctl status kubelet'
[2020-02-05T06:23:26.579Z] 	- 'journalctl -xeu kubelet'
[2020-02-05T06:23:26.579Z] 
[2020-02-05T06:23:26.579Z] Additionally, a control plane component may have crashed or exited when started by the container runtime.
[2020-02-05T06:23:26.579Z] To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
[2020-02-05T06:23:26.579Z] Here is one example how you may list all Kubernetes containers running in docker:
[2020-02-05T06:23:26.579Z] 	- 'docker ps -a | grep kube | grep -v pause'
[2020-02-05T06:23:26.579Z] 	Once you have found the failing container, you can inspect its logs with:
[2020-02-05T06:23:26.579Z] 	- 'docker logs CONTAINERID'
[2020-02-05T06:23:26.579Z]  ✗ Starting control-plane 🕹️
[2020-02-05T06:23:30.751Z] ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged tidb-operator-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

On this machine, sync takes more than 5 minutes to finish:

# time sync

real	5m38.067s
user	0m0.000s
sys	0m57.343s