kind: HA clusters don't reboot properly

first reported in https://github.com/kubernetes-sigs/kind/issues/1685 tracking in an updated bug.

reproduce with:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane

+ restart docker.

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 3
Comments: 34 (23 by maintainers)

Most upvoted comments

https://github.com/kubernetes-sigs/kind/issues/2045 (multi-node restart, not multi-control-plane-node) should be fixed at HEAD thanks to the patient work of @tnqn, have not dug into multi-control-plane node yet. Probably won’t be able to immediately, perhaps next week.

BenTheElder on May 26, 2022

Hi @BenTheElder I know that using DNS names is the cleanest solution for issue #2045

However I am using this script as a workaround to use static IPs for the nodes communication

I have restarted my cluster several times and it has worked fine so far

seguidor777 on Jul 29, 2021

Still needs root causing, but multiple user reports. We should fix this.

BenTheElder on Jun 25, 2020

We document this sort of thing at https://kind.sigs.k8s.io/docs/user/known-issues/ which the quick start links to prominently, but it seems this issue hasn’t made it there yet. Earlier versions did not support host restart at all, it wasn’t in scope early in the project.

BenTheElder on Jul 30, 2021

can’t we extend the kind config to take ip-node mapping as an optional parameter: to only those who know what they are doing. It can then completely replace every solution to perfectly fit in all needs.

This is not without its own drawbacks.

Not necessarily portable across node backends (e.g. out of our current options podman cannot do at least ipv6 this way).
Does not solve the need to set a reserved IP range in the kind network (so you will still need to do hacks outside of the kind tool…)
Adds infrequently used and untested codepath(s). (We are not going to add yet another CI job to exercise this, we have too many as-is and we have no need for this upstream https://kind.sigs.k8s.io/docs/contributing/project-scope/).

Multi-node clusters are a necessity for testing Kubernetes itself (where we expect clusters to be disposable over the course of developing some change to Kubernetes). For development of applications, we expect single node clusters to be most reasonable (and this is the case where it may make sense to persist them, though we’d still encourage regularly testing from a clean state).

The case of:

Requirement for multi-node
Requirement for persistence
Frequent reboots

Seems rather rare and I’m not sure it outweighs adding a broken partial solution that people will then depend on in the future even if we find some better design.

I’m not saying we definitely couldn’t do this, but I wouldn’t jump to doing it today.

BenTheElder on Jul 29, 2021

That’s a neat script!

It’s unfortunately not super workable as an approach to a built-in solution though. Users creating clusters concurrently in CI (and potentially with a “remote” daemon due to containerized CI) are very important to us and this approach is not safe there.

BenTheElder on Jul 29, 2021

I have restarted my cluster several times and it has worked fine so far

Users may have multiple clusters and that is hard to support, however, your script is great, I think that it also can solve the problems of snapshotting HA clusters.

aojea on Jul 29, 2021

@velcrine that’s actually a variation on the issues in https://github.com/kubernetes-sigs/kind/issues/2045

HA has a different additional problem in that the loadbalancer causes issues with the API being reachable after restart, in which case you wouldn’t even be able to query for those problems.

FWIW regarding HA Nobody is working on or using this feature much and it’s simplistic / not fully designed. This issue is unlikely to see work anytime soon. (priority/backlog)

The other issue (https://github.com/kubernetes-sigs/kind/issues/2045) is one I’m sure someone would work on except nobody has posited a good solution we can agree on yet or root caused the issues.

BenTheElder on Jul 9, 2021