kops: Upgrading from 1.21.X to 1.22.2 crashes api-server/etcd

/kind bug

**1. What kops version are you running? 1.22.1

**2. What Kubernetes version are you running? 1.21.4

**3. What cloud provider are you using? AWS

**4. What commands did you run? What is the simplest way to reproduce this issue?

kops-122-1 toolbox template --template cluster.tmpl.yaml --values common-values.yaml --values staging-values.yaml --snippets ./snippets > staging-cluster.yaml

kops-122-1 replace -f staging-cluster.yaml

kops-122-1 update cluster --yes

kops-122-1 rolling-update cluster --yes

**5. What happened after the commands executed? When it tried to upgrade the first of 3 control-plane nodes, it crashed all 3 and we’re unable to communicate with the api server. The error we see in etcd-main, etcd-cilium, and etcd-events containers are:

W1102 16:02:31.960891 4234 controller.go:163] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer “peer{id:“etcd-a” endpoints:“10.0.0.1:3996” }”: rpc error: code = Unknown desc = concurrent prepare in progress “hoevtUMUC44rbjBGIz9p_Q”

**6. What did you expect to happen? an in-place etcd upgrade from 3.4 to 3.5 without crashing all 3 control-planes

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 28 (8 by maintainers)

Most upvoted comments

I updated the title to better reflect the issue. We need some eyes on this because upgrading from 1.21 to 1.22 effectively brings down the cluster.

chrism417 on Nov 10, 2021

Thank you @chrism417 for sharing your steps. I basically did the same procedure and got the control plane and apiserver accessible again.

Initial deployment attempt

rolling-update masters; just one master was updated and then fails to validate and cluster apiserver access is lost

recovery procedure I followed

ssh to updated master and check logs that all was good other than the /var/etcd*.log complaining about the version mismatch
rolling-update of non-updated two masters with --cloudonly

At this point there were several system-critical pods that were still not starting or running properly (weave, ebs-csi-controller, etc) but eventually the etcd upgrade did complete and the apiserver was again available. To fix the remaining pod startup issues, another rolling update of the masters / control plane seemed to fix that.

btalbot on Nov 10, 2021

Assuming that I am not a kubernetes/kops/etcd committer, what is the proper way to recover from this failed upgrade? I have 1 of 3 masters on an integration test cluster upgraded to 1.22 but the api server is not accessible for anything, including the remaining 2 of 3 masters running 1.21.5.

Power ahead with upgrade – will all masters running 1.22 resolve this? Rollback to 1.21 somehow – I’ve never used kops to downgrade and not sure that it’s even a thing.

btalbot on Nov 9, 2021

I would like to report that I observed the same issues and logs from Etcd when upgrading from 1.21 to 1.22. I was only able to recover/proceed by killing the entire control plane (effectively following the etcd restore process) and then cleaning up the extra master IPs from etcd.

erismaster on Nov 3, 2021