cluster-api: KCP: upgrade gets stuck after removing etcd member

What happened

When we initiated a KCP upgrade on a self-hosted (“pivoted”) cluster, we observed the following sequence of events:

The KCP controller created a new machine, and it successfully joined the cluster (once we got rid of the smart quotes in the kubeadm config, but that’s a different story).

At this point, the KCP logs look pretty standard:

E0311 21:59:52.529099       1 kubeadm_control_plane_controller.go:548] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="there are 3 control plane nodes, but 4 etcd members" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
I0311 22:00:11.441365       1 kubeadm_control_plane_controller.go:269] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

But then something surprising happened (immediately following the above statement):

I0311 22:00:24.199072       1 leaderelection.go:288] failed to renew lease capi-kubeadm-control-plane-system/kubeadm-control-plane-manager-leader-election-capi: failed to tryAcquireOrRenew context deadline exceeded
E0311 22:00:24.199149       1 main.go:121] setup "msg"="problem running manager" "error"="leader election lost"

And when the pod restarted and came back up, the

I0311 22:00:43.533246       1 kubeadm_control_plane_controller.go:269] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
E0311 22:00:49.132408       1 kubeadm_control_plane_controller.go:548] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="[there are 4 control plane nodes, but 3 etcd members, node \"ip-10-0-0-1.ec2.internal\": failed to create etcd client: unable to create etcd client: context deadline exceeded, node \"ip-10-0-0-2.ec2.internal\": failed to create etcd client: unable to create etcd client: context deadline exceeded]" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

which repeats a few times for about a minute, though complaining about connection issues to different etcd members (but always ip-10-0-0-1.ec2.internal – the machine marked for removal). Then there’s a spate of this error:

W0311 22:01:55.293805       1 http.go:392] Error reading backend response: unexpected EOF
W0311 22:01:55.293822       1 http.go:392] Error reading backend response: unexpected EOF
W0311 22:01:55.294615       1 http.go:392] Error reading backend response: unexpected EOF
W0311 22:01:55.294743       1 http.go:392] Error reading backend response: unexpected EOF

and then things settle down to consistently stuck on:

I0311 22:02:02.347903       1 kubeadm_control_plane_controller.go:269] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
E0311 22:02:05.741965       1 kubeadm_control_plane_controller.go:548] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="[there are 4 control plane nodes, but 3 etcd members, node \"ip-10-0-0-1.ec2.internal\": failed to create etcd client: unable to create etcd client: context deadline exceeded]" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

It looks like things got stuck between removing the etcd member and marking the member as having been removed from etcd via annotation, because once we manually added the kubeadm.controlplane.cluster.x-k8s.io/scale-down-etcd-member-removed the upgrade was again able to proceed. It did get stuck in the same way on the next node, but the one after that was smooth.

Summary

We noticed two things from this sequence of events:

  1. The KCP controller is not able to recover effectively from a failure in etcd membership management (the self-hosting is pretty reliably injecting a fault into this process, but is not essential to the issue)
  2. The upgrade process itself is fairly disruptive to API clients. We (the controller + the humans poking at things) observed a fair number of hung connections and refused connections during the upgrade itself

Solutioneering

Ideas for building some more resilience into the etcd membership process include:

  • Teaching the healthcheck to ignore this condition (4 nodes, 3 members, the member that’s been removed can’t be communicated with) even if the machine hasn’t been annotated
  • Pre-marking the machine with a “about to remove” annotation before we start the etcd member removal as a declaration of intent

I’ll open a separate issue around the upgrade disruptiveness for discussion on that. In this case a less disruptive upgrade process would help in that it would make failures during the removal less likely.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

I’d be glad to help out with testing – this is my top goal today & tomorrow.

I am anti-testing. Please ship the code immediately.

-waits for this comment to show up in a performance review-

After chatting a little more with @chuckha seems we have not 1, 2, but multiple issues in here! The one I posted above is the next step, after we fix the healthcheck / make it more flexible.

Let’s take it one bit at a time and make sure to test it after each iterations (because more issues might pop up).

Can we work out a solution that doesn’t require annotations on objects? I’d rather have ways to understand the status by querying the source

https://github.com/kubernetes-sigs/cluster-api/blob/d06be1c58073953bc21e3d91762bd98dd28290c9/controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go#L534-L548

This annotation can probably be removed today if we query etcd and check if the current node is in the list or not. wdyt?

I think improving the healthcheck would remove the need for that annotation, +1 here.

/milestone v0.3.2 /assign /priority important-soon /lifecycle active

Had a quick chat with Seth, Mytu, and @vincepri. We think something that could help here would be to abandon the kubeadm.controlplane.cluster.x-k8s.io/scale-down-etcd-member-removed and instead always check if the member for the selected node exists, then remove member and delete Machine if it does exist.