kind: Cluster doesn't restart when docker restarts

When docker restarts or stop/start (for any reason), the kind node containers remain stopped and aren’t restarted properly. When I tried to run docker restart <node container id> the cluster didn’t start either.

The only solution seems to recreate the cluster at this point.

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 84
  • Comments: 97 (58 by maintainers)

Commits related to this issue

Most upvoted comments

👍 for the new restart cluster command!

I’ve been using kind locally (using Docker for Mac) and when docker reboots or stops, the cluster has to be deleted and recreated. I’m perfectly fine with it, just thought this might be something we should look into.

The use case was to keep the cluster around even after I reboot or shut down my machine / docker.

for the impatient, this seems to work for now after docker restarts:

docker start kind-1-control-plane && docker exec kind-1-control-plane sh -c 'mount -o remount,ro /sys; kill -USR1 1'

FixMounts has a few mount --make-shared, not sure if they are really required.

v0.8.0 will ship after follow up for this, I’m re-targeting for monday ideally.

it’s coming! the next PR is out 😃

Hi, I am working on this but I’ve had to spend the past week oncall for the Kubernetes test-infra and handling a few high impact Kubernetes testing bugs #1248 #1331

Please use github’s native +1 mechanism to +1 so we can use the issue for discussion of the solution:

image

What is the use case for this?

+1 to this question.

docker restart in this case will act like a power grid restart on a bunch of bare metal machines. so while those bare metal machines might come back up, not sure if we want to support this for kind. for that to work i think some sort of state has to be stored somewhere…

FTR: The latest releases should have clusters that come back up on docker restart, always, including multi node.

I will sent a PR next week. (

/lifecycle active thanks @tao12345666333

the “last” PR is now out. it needs some more cleanup and more validation, but the basic implementation is more or less good enough now and in an open PR.

this will be ready before we ship kind v0.8.0

The restart cluster command will make kind the top of his class. Without it, it’s a painful process to build test envs upon since restarting the whole process means re-downloading all the docker images from scratch, a lengthy process.

Next batch of PRs will be going out shortly. I had some other disruptions again (especially with kubernetes v1.18 code freeze PR reviews…), but I believe I have a workable approach for docker based nodes (which all current users are using, won’t work with podman though!) inbound.

restart seems it fits well with the other create/delete cluster commands, what’s the idea you had? Wondering if it actually fits the restart word or it’s something more.

@tao12345666333 I think ephemeral clusters are good but not in 100% of use cases. If you organise for example a workshop or a meetup, you would like to prepare everything in advance (some days before) and at the moment of the event, just spin up the cluster and that’s it. Like I did many times with minikube. Another example would be doing experiments. If I’m working for example with Calico, Cilium, Istio or else I don’t want to deploy them every time I need to run a simple test. It would be way easier to have many clusters and a time and spin up which you need and then stop it again. Do my samples make sense?

It should roughly be:

  • list the containers matching the cluster name
  • for each …
    • docker {re}start
    • run the pre-boot fixes (mounts)
    • signal the entrypoint to boot
  • optionally --wait for the control-plane like create

It’ll look similar to create but skip a lot of steps and swap creating the containers for list & {re}start

We can also eventually have a very similar command like kind restart node

As i understand it, the project has never supported multi-node clusters (only single nodes) but the documentation should really clearly specify this so that we aren’t spending a lot of time doing complex multi-node work to find it doesn’t survive a reboot or restart of docker. https://github.com/kubernetes-sigs/kind/issues/1689#issuecomment-889607041

@BenTheElder – Many thanks! this will make our lives easier!!!. I was troubleshooting a weird Azure issue for the last couple of weeks, so had no time for anything else. But this is awesome news

@BenTheElder Is this going to have only internal support for restarting the cluster if the docker daemon restarts, or is it also going to have some type of support from the CLI (e.g. kind stop cluster/kind start cluster or kind pause cluster/kind unpause cluster)?

As a partial workaround to speed up pods creation in a re-created cluster I mount containerd as volume to host machine, thus it survives cluster recreation and docker images are not downloaded every time after restart. e.g. I use following config for cluster creation:

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
  extraMounts:
  - containerPath: /var/lib/containerd
    hostPath: /home/me/.kind/cache/containerd

Running docker start minio-demo-control-plane && docker exec minio-demo-control-plane sh -c ‘mount -o remount,ro /sys; kill -USR1 1’ worked for me 👍

On a recent version ( >= 0.3.0) it should just be docker start <node-name>. The rest is handled in the entrypoint.

Please add restart.

We’d like to, but it’s not quite this simple to do correctly. 🙃 That snippet doesn’t work for multi-node clusters (see previous discussion around IP allocation etc.). For single node clusters currently it would just be an alias to docker start $NODE_NAME. It’s being worked on but is a bit lower priority than some Kubernetes testing concerns, ephemeral clusters are still recommended.

We have needed to restart a cluster several times, so I spent the time writing a script to restart the cluster and update the config accordingly

#!/usr/bin/env bash
KIND_CLUSTER="test"
KIND_CTX="kind-${KIND_CLUSTER}"

for container in $(kind get nodes --name ${KIND_CLUSTER}); do
      [[ $(docker inspect -f '{{.State.Running}}' $container) == "true" ]] || docker start $container
done
sleep 1
docker exec ${KIND_CLUSTER}-control-plane sh -c 'mount -o remount,ro /sys; kill -USR1 1'
kubectl config set clusters.${KIND_CTX}.server $(kind get kubeconfig --name ${KIND_CLUSTER} -q | yq read -j - | jq -r '.clusters[].cluster.server')
kubectl config set clusters.${KIND_CTX}.certificate-authority-data $(kind get kubeconfig --name ${KIND_CLUSTER} -q | yq read -j - | jq -r '.clusters[].cluster."certificate-authority-data"')
kubectl config set users.${KIND_CTX}.client-certificate-data $(kind get kubeconfig --name ${KIND_CLUSTER} -q | yq read -j - | jq -r '.users[].user."client-certificate-data"')
kubectl config set users.${KIND_CTX}.client-key-data $(kind get kubeconfig --name ${KIND_CLUSTER} -q | yq read -j - | jq -r '.users[].user."client-key-data"')

The client-cert and client-key shouldn’t change but since I was already updating the port, which changes whenever the control-plane is restarted, it was just a safety check to update all of them

+1 lets please get this done. We use KIND as a local SDK in a multi-node cluster that has been configured to match our higher environments in terms of setup and security. The process is phenomenal until a developer restarts and the entire cluster is rendered useless. I understand this use case isn’t exactly the one KIND is designed for, but shifting-left with such low overhead afforded to us with KIND has been a game-changer and we would hate to have to rever to a single minikube node.

Local storage is fixed, working on this one again. /assign /lifecycle active

@carlisia As Ben said, we still recommend ephemeral clusters.

#408 is processing the command to add restart command, but before that, we need to deal with some network related issues #484

Thanks for the data point @janwillies. This is definitely not actually supported properly (yet?) and would/will require a number of fixes, some of which are in progress. In the mean time we’ve continued to push to make it cheaper to create / delete and test with clean clusters. When 0.4 releases we expect kubernetes 1.14.X to start in ~20s if the image is warm locally.

#461 removed the SIGUSR1 and mount fix commands, docker start should ~work for single-node clusters, multi-node will require an updated #408 😅

tentatively tracking for 0.3

Was the “restart” functionality ever shipped? I am using version 0.14.0 and dont see “restart” option in help message.

I can’t figure out a way to restart my cluster:

╰─λ kind --help
kind creates and manages local Kubernetes clusters using Docker container 'nodes'

Usage:
kind [command]

Available Commands:
build       Build one of [node-image]
completion  Output shell completion code for the specified shell (bash, zsh or fish)
create      Creates one of [cluster]
delete      Deletes one of [cluster]
export      Exports one of [kubeconfig, logs]
get         Gets one of [clusters, nodes, kubeconfig]
help        Help about any command
load        Loads images into nodes
version     Prints the kind CLI version

Flags:
-h, --help              help for kind
--loglevel string   DEPRECATED: see -v instead
-q, --quiet             silence all stderr output
-v, --verbosity int32   info log verbosity, higher value produces more output
--version           version for kind

Use "kind [command] --help" for more information about a command.

I am on Arch Linux.

@aojea Ok, then we await a better solution. 😃

This works:

$ docker ps -aq --filter 'label=io.x-k8s.kind.cluster' | awk '{print $1}' | xargs docker start
389fbc7f27c0
8234fdc273f5

That works ONLY if the container gets assigned the same IP it had before it was stopped. Docker uses the IPAM implemented in libnetwork and it doesn’t guarantee the container will get the same IP.

This works:

$ docker ps -aq --filter 'label=io.x-k8s.kind.cluster' | awk '{print $1}' | xargs docker start
389fbc7f27c0
8234fdc273f5

To start with I’m focusing solely on having it automatically restart correctly, but once those fixes are in place I expect stop / start / pause / unpause will make sense as a future step.

Just installed kind from default branch, and one node kind cluster works well after container restart. Have tried kill + start, and docker daemon restart. Thank you!

This is on my radar, we’ve just had some other pressing changes to tackle (mostly around testing kubernetes, the stated #1 priority and original reason for the project) and nobody has proposed a maintainable solution to the network issues yet. I’ll look at this more this cycle.

The main problem is that the container is not guaranteed to take the same IP that was assigned before the reboot, and that will break the cluster.

However, one user reported a working method in the slack channel https://kubernetes.slack.com/archives/CEKK1KTN2/p1565109268365000

cscetbon 6:34 PM
@Gustavo Sousa what I use :
alias kpause='kind get nodes|xargs docker pause'
alias kunpause='kind get nodes|xargs docker unpause'
(edited)

I would like to add two things:

Running docker start minio-demo-control-plane && docker exec minio-demo-control-plane sh -c 'mount -o remount,ro /sys; kill -USR1 1' worked for me 👍

Please add restart.