cluster-api: During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail

/kind bug

Follow up from #2770

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it’s more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout.

Due to a mixture of https://github.com/kube-vip/kube-vip/issues/214 and the fact we haven’t yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication completes but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup.

What did you expect to happen:

Rollout succeeds.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

It might be worth in these instances to have a configurable bootstrap token TTL. We can’t account for every type of environment with a longer TTL so we should make it configurable and vendors and cluster operators can decide, since there is a security trade off.

Furthermore, VMware would like to request a backport to v0.3.x branch.

Environment:

Cluster-api- version: v0.3.22
Kubernetes version: (use kubectl version): v1.21
OS (e.g. from /etc/os-release): Photon 3

About this issue

Original URL
State: open
Created 3 years ago
Comments: 22 (21 by maintainers)

Most upvoted comments

Any pointers on that would be handy.

Here is the kep / issue. Still WIP. Although this is for k8s LE: https://github.com/kubernetes/enhancements/issues/2835

neolit123 on Oct 22, 2021