kind: kind create cluster fails to remove control plane taint

What happened:

I tried to create a cluster with kind create cluster and received the error “failed to remove control plane taint”

What you expected to happen:

Successfully creating a cluster

How to reproduce it (as minimally and precisely as possible):

Install kind version v0.14.0-arm64 and call kind create cluster

Anything else we need to know?:

I’m running kind inside a container with the hosts (MacOS M1 Max) docker socket mounted and I’m able to run other containers with docker run.

Logs:

$ kind create cluster --loglevel=debug
WARNING: --loglevel is deprecated, please switch to -v and -q!
Creating cluster "kind" ...
DEBUG: docker/images.go:58] Image: kindest/node:v1.24.0@sha256:0866296e693efe1fed79d5e6c7af8df71fc73ae45e3679af05342239cdc5bc8e present locally
 ✓ Ensuring node image (kindest/node:v1.24.0) 🖼
 ✓ Preparing nodes 📦  
DEBUG: config/config.go:96] Using the following kubeadm config for node kind-control-plane:
apiServer:
  certSANs:
  - localhost
  - 127.0.0.1
  extraArgs:
    runtime-config: ""
apiVersion: kubeadm.k8s.io/v1beta3
clusterName: kind
controlPlaneEndpoint: kind-control-plane:6443
controllerManager:
  extraArgs:
    enable-hostpath-provisioner: "true"
kind: ClusterConfiguration
kubernetesVersion: v1.24.0
networking:
  podSubnet: 10.244.0.0/16
  serviceSubnet: 10.96.0.0/16
scheduler:
  extraArgs: null
---
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- token: abcdef.0123456789abcdef
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 172.19.0.2
  bindPort: 6443
nodeRegistration:
  criSocket: unix:///run/containerd/containerd.sock
  kubeletExtraArgs:
    node-ip: 172.19.0.2
    node-labels: ""
    provider-id: kind://docker/kind/kind-control-plane
---
apiVersion: kubeadm.k8s.io/v1beta3
controlPlane:
  localAPIEndpoint:
    advertiseAddress: 172.19.0.2
    bindPort: 6443
discovery:
  bootstrapToken:
    apiServerEndpoint: kind-control-plane:6443
    token: abcdef.0123456789abcdef
    unsafeSkipCAVerification: true
kind: JoinConfiguration
nodeRegistration:
  criSocket: unix:///run/containerd/containerd.sock
  kubeletExtraArgs:
    node-ip: 172.19.0.2
    node-labels: ""
    provider-id: kind://docker/kind/kind-control-plane
---
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
cgroupRoot: /kubelet
evictionHard:
  imagefs.available: 0%
  nodefs.available: 0%
  nodefs.inodesFree: 0%
failSwapOn: false
imageGCHighThresholdPercent: 100
kind: KubeletConfiguration
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
conntrack:
  maxPerCore: 0
iptables:
  minSyncPeriod: 1s
kind: KubeProxyConfiguration
mode: iptables
 ✓ Writing configuration 📜 
DEBUG: kubeadminit/init.go:82] I0808 18:27:28.895581     126 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
W0808 18:27:28.896451     126 initconfiguration.go:332] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.24.0
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0808 18:27:28.900057     126 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0808 18:27:29.115670     126 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kind-control-plane kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.19.0.2 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0808 18:27:29.338086     126 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0808 18:27:29.421219     126 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0808 18:27:29.554232     126 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0808 18:27:29.615892     126 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0808 18:27:30.083897     126 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
I0808 18:27:30.124183     126 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
I0808 18:27:30.254718     126 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0808 18:27:30.362542     126 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0808 18:27:30.463815     126 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0808 18:27:30.698207     126 kubelet.go:65] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0808 18:27:30.784998     126 manifests.go:99] [control-plane] getting StaticPodSpecs
I0808 18:27:30.785329     126 certs.go:522] validating certificate period for CA certificate
I0808 18:27:30.785397     126 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0808 18:27:30.785414     126 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0808 18:27:30.785417     126 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0808 18:27:30.785420     126 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0808 18:27:30.785424     126 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
I0808 18:27:30.786696     126 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0808 18:27:30.786710     126 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0808 18:27:30.786809     126 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0808 18:27:30.786818     126 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0808 18:27:30.786821     126 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0808 18:27:30.786823     126 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0808 18:27:30.786826     126 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0808 18:27:30.786828     126 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0808 18:27:30.786830     126 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
I0808 18:27:30.787252     126 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0808 18:27:30.787273     126 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0808 18:27:30.787392     126 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0808 18:27:30.787617     126 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0808 18:27:30.787952     126 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0808 18:27:30.787989     126 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0808 18:27:30.788334     126 loader.go:372] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0808 18:27:30.790085     126 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
 ✗ Starting control-plane 🕹️ 
ERROR: failed to create cluster: failed to remove control plane taint: command "docker exec --privileged kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes --all node-role.kubernetes.io/control-plane- node-role.kubernetes.io/master-" failed with error: exit status 1
Command Output: The connection to the server kind-control-plane:6443 was refused - did you specify the right host or port?
Stack Trace: 
sigs.k8s.io/kind/pkg/errors.WithStack
        sigs.k8s.io/kind/pkg/errors/errors.go:59
sigs.k8s.io/kind/pkg/exec.(*LocalCmd).Run
        sigs.k8s.io/kind/pkg/exec/local.go:124
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.(*nodeCmd).Run
        sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/node.go:146
sigs.k8s.io/kind/pkg/cluster/internal/create/actions/kubeadminit.(*action).Execute
        sigs.k8s.io/kind/pkg/cluster/internal/create/actions/kubeadminit/init.go:140
sigs.k8s.io/kind/pkg/cluster/internal/create.Cluster
        sigs.k8s.io/kind/pkg/cluster/internal/create/create.go:135
sigs.k8s.io/kind/pkg/cluster.(*Provider).Create
        sigs.k8s.io/kind/pkg/cluster/provider.go:182
sigs.k8s.io/kind/pkg/cmd/kind/create/cluster.runE
        sigs.k8s.io/kind/pkg/cmd/kind/create/cluster/createcluster.go:80
sigs.k8s.io/kind/pkg/cmd/kind/create/cluster.NewCommand.func1
        sigs.k8s.io/kind/pkg/cmd/kind/create/cluster/createcluster.go:55
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.4.0/command.go:902
sigs.k8s.io/kind/cmd/kind/app.Run
        sigs.k8s.io/kind/cmd/kind/app/main.go:53
sigs.k8s.io/kind/cmd/kind/app.Main
        sigs.k8s.io/kind/cmd/kind/app/main.go:35
main.main
        sigs.k8s.io/kind/main.go:25
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_arm64.s:1263

Environment:

  • kind version: (use kind version): kind v0.14.0 go1.18.2 linux/arm64
  • Kubernetes version: (use kubectl version): Client Version: v1.24.3 Kustomize Version: v4.5.4
  • Docker version: (use docker info):
Client:
  Context:    default
  Debug Mode: false
  Plugins:
    buildx: Docker Buildx (Docker Inc., 0.8.2+azure-1)
    compose: Docker Compose (Docker Inc., 2.9.0+azure-1)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.104-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 5
 Total Memory: 14.62GiB
 Name: docker-desktop
 ID: DWAP:AOR6:N5DU:HCAK:GC35:RRZ6:4YMP:4JVL:UJ66:GKCY:N6RR:VAAL
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (7 by maintainers)

Commits related to this issue

Most upvoted comments

Do you think it would make sense to add some retry mechanism to this taint call?

Tentatively, that’s just patching over one particular symptom of the networking being broken.

What’s happening now is that the whole creation gets rolled back after the cluster is already created, effectively due to networking.

Well yes, we can’t very well run a functional cluster with broken networking.

I’m wondering if it might make sense to add an exponential backoff retry (with a fairly low max) to make this more reliable.

I don’t think that’s reasonable after the API server is up, this is a local API call executed itself on one of the control plane nodes itself. We already have an exponential retry waiting for the api server to be ready in kubeadm. Perhaps one retry, but again, this should not flake, it should be a very cheap local call, if it’s failing, it’s a symptom of the cluster being in a bad state of some sort.

I suggested a possible solution above, but I’d like to understand what / why this is actually broken before I jump on making any changes.

There’s no reason resolving the container names should fail, docker is responsible for this and I’ve

Or put differently, it’s not docker exec --privileged that’s failing, but rather kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes … within the container (and it’s executed within that container regardless of how you called docker, I’d assume).

Yes, but it only seems to be failing when the docker socket is mounted when using kind, and it seems to be related to DNS issues, which makes me think mounting the docker socket when creating the cluster is leading to somewhat broken DNS in the cluster, which doesn’t make sense given my understanding of how docker implements DNS, but none of this makes sense … The dns response for the node name should be local from docker and should be quick and reliable ™️

So far we’ve had no reports of this with standard local docker socket without containerizing kind itself of using docker over TCP, though I can’t fathom why those are relevant.

Unfortunately, without a way to replicate this, I’m reliant on you all to identify why docker containers are not reliably able to resolve themselves or what else is making this call fail.

After some investigating. The problem is, the name resolution in the control-plane doesn’t work immediately when the container is started. Adding a sleep before running the remove taint command works as a hacky fix.

diff --git a/pkg/cluster/internal/create/actions/kubeadminit/init.go b/pkg/cluster/internal/create/actions/kubeadminit/init.go
index cc587940..e9778ce9 100644
--- a/pkg/cluster/internal/create/actions/kubeadminit/init.go
+++ b/pkg/cluster/internal/create/actions/kubeadminit/init.go
@@ -19,6 +19,7 @@ package kubeadminit
 
 import (
        "strings"
+       "time"
 
        "sigs.k8s.io/kind/pkg/errors"
        "sigs.k8s.io/kind/pkg/exec"
@@ -135,6 +136,7 @@ func (a *action) Execute(ctx *actions.ActionContext) error {
                taintArgs := []string{"--kubeconfig=/etc/kubernetes/admin.conf", "taint", "nodes", "--all"}
                taintArgs = append(taintArgs, taints...)
 
+               time.Sleep(5 * time.Second)
                if err := node.Command(
                        "kubectl", taintArgs...,
                ).Run(); err != nil {

IDK why this is the case with our environments. We are both running docker-from-docker on Apple M1, not sure how much is coincidence.

Thoughts @BenTheElder ?