cluster-api: [KCP] Fails to upgrade single node control plane

What steps did you take and what happened: On a single-node control-plane, I upgraded kubernetes version, etcd and CoreDNS image tags by modifying KCP object. New node with upgraded Kubernetes version is created, old node is physically deleted but old node object and all pods that was on the old pod are dangling.

root@cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr:/# kubectl get nodes -A
NAME                                                  STATUS     ROLES    AGE     VERSION
cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj     NotReady   master   7m11s   v1.17.0
cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr     Ready      master   6m20s   v1.17.2
cluster-tvnb5a-cluster-tvnb5a-md-0-66496845f6-kvh9v   Ready      <none>   6m51s   v1.17.0
root@cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr:/# kubectl get pods -A
NAMESPACE     NAME                                                                        READY   STATUS             RESTARTS   AGE
kube-system   coredns-6955765f44-dpwvr                                                    1/1     Terminating        0          7m12s
kube-system   coredns-6955765f44-g4df6                                                    1/1     Terminating        0          7m12s
kube-system   coredns-7987b8d68f-f2d78                                                    1/1     Running            0          4m24s
kube-system   coredns-7987b8d68f-rftvl                                                    1/1     Running            0          4m23s
kube-system   etcd-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj                      0/1     CrashLoopBackOff   3          7m16s
kube-system   etcd-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr                      1/1     Running            0          6m19s
kube-system   kindnet-s6rvq                                                               1/1     Running            0          7m8s
kube-system   kindnet-wmxqx                                                               1/1     Running            0          6m37s
kube-system   kindnet-xvdks                                                               1/1     Running            0          7m12s
kube-system   kube-apiserver-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj            1/1     Running            0          7m16s
kube-system   kube-apiserver-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr            1/1     Running            0          6m36s
kube-system   kube-controller-manager-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj   1/1     Running            1          7m16s
kube-system   kube-controller-manager-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr   1/1     Running            0          6m35s
kube-system   kube-proxy-gfghs                                                            1/1     Running            0          4m22s
kube-system   kube-proxy-lj9lv                                                            1/1     Terminating        0          7m12s
kube-system   kube-proxy-qlgpw                                                            1/1     Running            0          3m57s
kube-system   kube-scheduler-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj            1/1     Running            1          7m16s
kube-system   kube-scheduler-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr            1/1     Running            0          6m35s
root@cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr:/#

What did you expect to happen: I expected KCP to remove node object and cleanup resources that was on that node.

/kind bug /area control-plane

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 26 (12 by maintainers)

Most upvoted comments

Yes, in the scale down. It is failing to remove etcd member, hence machine is never deleted. [manager] E0416 21:46:02.455746 8 scale.go:117] "msg"="Failed to remove etcd member for machine" "error"="failed to create etcd client: unable to create etcd client: context deadline exceeded" "cluster-nanme"="sedef" "name"="sedef" "namespace"="default"

I don’t understand why it happens only sometimes though.

I have seen that consistently in metal3-dev-env. In the upgrade of K8S version process, both during scale up and scale down. In order to reach the scale down phase.

@sedefsavas Do you have any update on this issue or some time line ?

Xenwar on Apr 28, 2020

Can you run a test locally with a custom build after adding a check the NodeRef is there for the leaderCandidate and see if that fixes it?

vincepri on Apr 17, 2020

@vincepri @detiber Sorry for the confusion. I am not scaling down, I changed the hardcoded 3 control plane replica number in the test to 1.

sedefsavas on Apr 15, 2020