kubeadm: kubeadm join on control plane node failing: timeout waiting for etcd

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT or FEATURE REQUEST

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.5”, GitCommit:“6b1d87acf3c8253c123756b9e61dac642678305f”, GitTreeState:“clean”, BuildDate:“2021-03-18T01:08:27Z”, GoVersion:“go1.15.8”, Compiler:“gc”, Platform:“linux/amd64”}

Environment:

  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.5”, GitCommit:“6b1d87acf3c8253c123756b9e61dac642678305f”, GitTreeState:“clean”, BuildDate:“2021-03-18T01:10:43Z”, GoVersion:“go1.15.8”, Compiler:“gc”, Platform:“linux/amd64”}
  • Cloud provider or hardware configuration: cluster-api / capz
  • OS (e.g. from /etc/os-release): NAME=“Ubuntu” VERSION=“18.04.5 LTS (Bionic Beaver)” ID=ubuntu ID_LIKE=debian PRETTY_NAME=“Ubuntu 18.04.5 LTS” VERSION_ID=“18.04” HOME_URL=“https://www.ubuntu.com/” SUPPORT_URL=“https://help.ubuntu.com/” BUG_REPORT_URL=“https://bugs.launchpad.net/ubuntu/” PRIVACY_POLICY_URL=“https://www.ubuntu.com/legal/terms-and-policies/privacy-policy” VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
  • Kernel (e.g. uname -a): Linux acse-test-capz-repro-c8cd6-control-plane-9kvrx 5.4.0-1041-azure #43~18.04.1-Ubuntu SMP Fri Feb 26 13:02:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Others: Cluster built using cluster-api from this capz example template:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/cluster-template.yaml

tl;dr 3 control plane nodes, 1 node pool w/ 1 worker node

What happened?

  1. The first control plane node comes online and Ready (kubeadm init)
  2. The 2nd control plane node bootstraps but never comes online/Ready (kubeadm join)

From the cloud-init logs, kubeadm tells us that it timed out waiting for etcd:

[2021-04-16 22:09:39] [etcd] Announced new etcd member joining to the existing etcd cluster
[2021-04-16 22:09:39] [etcd] Creating static Pod manifest for "etcd"
[2021-04-16 22:09:39] [etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[2021-04-16 22:10:12] [kubelet-check] Initial timeout of 40s passed.
[2021-04-16 22:42:38] error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available

What you expected to happen?

This does not repro in other Kubernetes versions. I’ve tested 1.19.7 specifically. I expected 1.20.5 to bootstrap as 1.19.7 does.

How to reproduce it (as minimally and precisely as possible)?

I have a repro script:

https://github.com/jackfrancis/cluster-api-provider-azure/blob/repro/repro.sh

Anything else we need to know?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (10 by maintainers)

Most upvoted comments

@neolit123 I see your point that we want to reduce the likelihood of kubelet race conditions

In the meantime we will continue to investigate how to produce a working 1.20+ kubeadm solution for folks.

I’ll follow the issue you linked and close this one for now, thanks!