kind: multi-node: Kubernetes cluster does not start after Docker re-assigns node's IP addresses after (Docker) restart

Kind Kubernetes cluster does not survive Docker restart. It seems that docker assigns new IPs to containers on each start-up. The KIND nodes however have original IP addresses specified in the generated configuration files causing kubernetes services unable to talk to each other. The most affected ones are scheduler and controller.

What happened:

Kubernetes starts in broken state even though kubectl get pods -A reports otherwise (everything 1/1). The cluster is unable start deployed pods (if deployed before restart) and is unable to deploy anything new due to scheduler is not connected to apiserver.

What you expected to happen:

Kubernetes cluster continues working as expected even after Docker restart.

How to reproduce it (as minimally and precisely as possible):

Install KIND cluster by issuing:

cat <<EOF | kind create cluster --name kind --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF

Restart Docker
Deploy anything, e.g.: kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
Check dnsutils are in Pending state

Anything else we need to know?:

Log files: kind-cluster-logs.tar.gz
I tried to change IP addresses in /kind and /etc/kubernets files but than the services start complaining about certificate not issued for the IP address. Changing IP addresses each time the cluster starts is therefor not a solution.

Environment:

kind version: (use kind version):

kind v0.10.0 go1.15.7 darwin/amd64

Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (use docker info):

Client:
 Version:           20.10.0
 API version:       1.41
 Go version:        go1.15.6
 Git commit:        03fa4b8
 Built:             Sat Dec 12 20:00:39 2020
 OS/Arch:           darwin/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.2
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8891c58
  Built:            Mon Dec 28 16:15:28 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

OS (e.g. from /etc/os-release): macOS Catalina 10.15.7 Intel

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 34 (20 by maintainers)

Most upvoted comments

Of course, just imagine

I start a course that lasts 2 days (when it could well be 5),

I created the cluster
I created pods, replica sets, implementations, services, etc … each one with their yml configuration file
I Uploaded custom images to cluster
I get the cluster in a certain state … And yes, the day is finished and I turn off the laptop So … the next morning I have to recreate all the work from yesterday to continue with today’s course … just because on reboot I lost connectivity to the cluster and can’t get back online. Just with a workaround available could it be fine. But for now to teach I have to first explain VirtualBox (or something else) to use minikube And yes I like delete all with kind and rebuild all again is good practice, but no for long courses which many times between days I am explain a feature

I think kind is almost perfect to teach (and learn) but this issue continue being a little headache

Best Regards

pablodgonzalez on Dec 20, 2021

I hope I understand it better now. Let me summarize and please correct me if I’m wrong:

Kind installs cluster with all it initializations and introduces IP addresses into /kind/ and /etc/kubernets/ directories.
Each time the cluster starts the entrypoint tries to manage the IP addresses somehow.

What I’ve found is that the second step does something silly. Originally I made myself a shell script which re-configures the IP addresses to mach the current state, but that fails as well as all the security certificates generated in step 1 are based on IP addresses in their CNs. Which leads to a situation that the services are finally able to contact to each other but they reject the certificates and we are back in square one.

I see only two ways to solve this issue permanently:

Use static IP addresses (which is problematic to do with Docker), or
use host names everywhere from the total beginning.

I hope I got the idea.

hadrabap on Feb 1, 2021

@boldandbusted Thanks for share your repo but the idea is avoid install any VMs or take time for explain another tools. This is for the target public, many students are developers, architects, and sometimes decision makers, so, I want focus on kubernetes and his benefits and not get noise from another tools or setups. For now I got a config for multi control plane node to use over the course ending but it lost the enchant of working hands on from the beginning and discover it for your self

pablodgonzalez on Dec 28, 2021

Hi,

thanks for disabling the bot !

the use case is simple : you work on a project and need to have a stable multinode env. You need to rebuild the cluster each day or each reboot.

The goal of kind is to simulate an env, but if one needs to rebuild each time you reboot, it is clearly a game breaker for kind.

I stopped using kind and built a real cluster, until issue solved.

thanks. regards.

gagipro on Jul 2, 2021

(Thanks @tnqn !)

BenTheElder on May 26, 2022

This should be fixed for most multi-node clusters in the latest sources at HEAD, and in the forthcoming v0.15.0 (TBD, we’ll want to wrap up some other things and make sure this is working widely before cutting a release).

#1689 remains for tracking clusters with multiple control-plane nodes (“HA”) which we haven’t dug into yet.

BenTheElder on May 26, 2022