cluster-api: issue with provisioning worker cluster (dial tcp 172.17.0.2:6443: i/o timeout)

What steps did you take and what happened: [A clear and concise description on how to REPRODUCE the bug.] I am installing clusterApi using kind, follow steps on this link(it is almost the same with quickstart tutorial). https://notes.elmiko.dev/2020/04/17/exploring-cluster-api.html

After all the steps, the “controlplane-0” and “worker-0” never go to running state.

[root@sdw1 v0.3.0]# kubectl --kubeconfig work-cluster.kubeconfig get nodes NAME STATUS ROLES AGE VERSION work-cluster-controlplane-0 Ready master 6m29s v1.17.2 [root@sdw1 v0.3.0]# kubectl --kubeconfig work-cluster.kubeconfig get nodes NAME STATUS ROLES AGE VERSION work-cluster-controlplane-0 Ready master 13m v1.17.2

[root@sdw1 v0.3.0]# kubectl get machines NAME PROVIDERID PHASE controlplane-0 docker:////work-cluster-controlplane-0 Provisioning worker-0 Pending

[root@sdw1 v0.3.0]# kubectl get pods -n capi-system NAME READY STATUS RESTARTS AGE capi-controller-manager-6d7b58b889-rw6s4 2/2 Running 0 18m

The log message from infrastructure provider shows message as below.

E0506 02:52:56.234289 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to create client for Cluster default/work-cluster: Get https://172.17.0.2:6443/api?timeout=32s: dial tcp 172.17.0.2:6443: i/o timeout" "controller"="machine" "request"={"Namespace":"default","Name":"controlplane-0"} I0506 02:53:02.615820 1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="work-cluster" "machine"="worker-0" "namespace"="default"

E0506 02:53:02.615883 1 machine_controller.go:226] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Bootstrap provider for Machine \"worker-0\" in namespace \"default\" is not ready, requeuing: requeue in 30s" "cluster"="work-cluster" "machine"="worker-0" "namespace"="default" I0506 02:53:12.009865 1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="work-cluster" "machine"="worker-0" "namespace"="default"

E0506 02:53:12.009929 1 machine_controller.go:226] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Bootstrap provider for Machine \"worker-0\" in namespace \"default\" is not ready, requeuing: requeue in 30s" "cluster"="work-cluster" "machine"="worker-0" "namespace"="default" E0506 02:53:32.617682 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to create client for Cluster default/work-cluster: Get https://172.17.0.2:6443/api?timeout=32s: dial tcp 172.17.0.2:6443: i/o timeout" "controller"="machine" "request"={"Namespace":"default","Name":"controlplane-0"} I0506 02:53:42.014170 1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="work-cluster" "machine"="worker-0" "namespace"="default"

E0506 02:53:42.014228 1 machine_controller.go:226] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Bootstrap provider for Machine \"worker-0\" in namespace \"default\" is not ready, requeuing: requeue in 30s" "cluster"="work-cluster" "machine"="worker-0" "namespace"="default"

I0506 02:54:12.020773 1 machine_controller_noderef.go:53] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="work-cluster" "machine"="worker-0" "namespace"="default"

What did you expect to happen: worker cluster successfully provisioned.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: v0.3.0
Minikube/KIND version: 0.8.0
Kubernetes version: (use kubectl version): [root@sdw1 v0.3.0]# kubectl version Client Version: version.Info{Major:“1”, Minor:“16”, GitVersion:“v1.16.2”, GitCommit:“c97fe5036ef3df2967d086711e6c0c405941e14b”, GitTreeState:“clean”, BuildDate:“2019-10-15T19:18:23Z”, GoVersion:“go1.12.10”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.2”, GitCommit:“52c56ce7a8272c798dbc29846288d7cd9fbae032”, GitTreeState:“clean”, BuildDate:“2020-04-30T20:19:45Z”, GoVersion:“go1.13.9”, Compiler:“gc”, Platform:“linux/amd64”}
OS (e.g. from /etc/os-release):

[root@sdw1 v0.3.0]# cat /etc/redhat-release CentOS Linux release 7.3.1611 (Core)

[root@sdw1 v0.3.0]# docker --version Docker version 19.03.2, build 6a30dfc

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 26 (15 by maintainers)

Commits related to this issue

Added variable for kind 0.8.0+ due the issue https://github.com/kubernetes-sigs/cluster-api/issues/3013 Signed-off-by: Andrea Spagnolo <aspagnolo@vmware.com> — committed to hazbo/cluster-api by spagno 4 years ago
Added variable for kind 0.8.0+ due the issue https://github.com/kubernetes-sigs/cluster-api/issues/3013 Signed-off-by: Andrea Spagnolo <aspagnolo@vmware.com> — committed to spectrocloud/cluster-api by spagno 4 years ago

Most upvoted comments

Closing this now that the kind v0.8.x instructions have been merged.

/close

vincepri on Aug 3, 2020

I think that this might be related to the recent kind 0.8.0 release which changes the default network in which kind clusters are created.

Is this a CAPD “bug”? What CAPD should do now? Create the workload clusters in the same Docker network in which CAPD is running?

Ref: https://github.com/kubernetes-sigs/kind/releases/tag/v0.8.0

Agree. With this PR, i did some experiments. https://github.com/kubernetes-sigs/kind/pull/1538

[root@sdw1 clusterApiTest]# export KIND_EXPERIMENTAL_DOCKER_NETWORK=bridge [root@sdw1 clusterApiTest]# kind create cluster --name testcluster Creating cluster “testcluster” … WARNING: Overriding docker network due to KIND_EXPERIMENTAL_DOCKER_NETWORK WARNING: Here be dragons! This is not supported currently. ✓ Ensuring node image (kindest/node:v1.18.2) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to “kind-testcluster” You can now use your cluster with:

kubectl cluster-info --context kind-testcluster

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂 [root@sdw1 clusterApiTest]# [root@sdw1 clusterApiTest]# [root@sdw1 clusterApiTest]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5c2fd5864ce4 kindest/node:v1.18.2 “/usr/local/bin/entr…” 47 seconds ago Up 45 seconds 127.0.0.1:46534->6443/tcp testcluster-control-plane 6c13e01f16a7 kindest/node:v1.17.2 “/usr/local/bin/entr…” 46 hours ago Up 46 hours work-cluster-worker-0 dbee24165041 kindest/node:v1.17.2 “/usr/local/bin/entr…” 46 hours ago Up 46 hours 45625/tcp, 127.0.0.1:45625->6443/tcp work-cluster-controlplane-0 14f9d231d4c5 kindest/haproxy:2.1.1-alpine “/docker-entrypoint.…” 46 hours ago Up 46 hours 46169/tcp, 0.0.0.0:46169->6443/tcp work-cluster-lb 7517986056a4 kindest/node:v1.18.2 “/usr/local/bin/entr…” 46 hours ago Up 46 hours 127.0.0.1:40518->6443/tcp clusterapi-control-plane [root@sdw1 clusterApiTest]# [root@sdw1 clusterApiTest]# docker exec -it 5c2fd5864ce4 /bin/bash [root@testcluster-control-plane:/# … [root@testcluster-control-plane:/# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.17.0.6 netmask 255.255.0.0 broadcast 172.17.255.255 ether 02:42:ac:11:00:06 txqueuelen 0 (Ethernet) RX packets 10490 bytes 15697524 (15.6 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 8320 bytes 688158 (688.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Looks like latest version of kind would definitely cause that issue.

ClusterApi use docker command to spawn containers under bridge docker0, but control panel is using bare-metal kind command under bridge kind. It should be the root cause of issue. https://github.com/kubernetes-sigs/cluster-api/blob/0b964b734167081a2a994f28983877140c58c50c/test/infrastructure/docker/docker/kind_manager.go#L256-L285

Short term walkaround is to use this variable to assign the bridge using by control panel container.

KIND_EXPERIMENTAL_DOCKER_NETWORK

ginobiliwang on May 11, 2020