kops: Upgrading from 1.21.X to 1.22.2 crashes api-server/etcd

/kind bug

**1. What kops version are you running? 1.22.1

**2. What Kubernetes version are you running? 1.21.4

**3. What cloud provider are you using? AWS

**4. What commands did you run? What is the simplest way to reproduce this issue?

kops-122-1 toolbox template --template cluster.tmpl.yaml --values common-values.yaml --values staging-values.yaml --snippets ./snippets > staging-cluster.yaml

kops-122-1 replace -f staging-cluster.yaml

kops-122-1 update cluster --yes

kops-122-1 rolling-update cluster --yes

**5. What happened after the commands executed? When it tried to upgrade the first of 3 control-plane nodes, it crashed all 3 and we’re unable to communicate with the api server. The error we see in etcd-main, etcd-cilium, and etcd-events containers are:

W1102 16:02:31.960891 4234 controller.go:163] unexpected error running etcd cluster reconciliation loop: error from JoinClusterRequest (prepare) from peer “peer{id:“etcd-a” endpoints:“10.0.0.1:3996” }”: rpc error: code = Unknown desc = concurrent prepare in progress “hoevtUMUC44rbjBGIz9p_Q”

**6. What did you expect to happen? an in-place etcd upgrade from 3.4 to 3.5 without crashing all 3 control-planes

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 28 (8 by maintainers)

Most upvoted comments

I updated the title to better reflect the issue. We need some eyes on this because upgrading from 1.21 to 1.22 effectively brings down the cluster.

Thank you @chrism417 for sharing your steps. I basically did the same procedure and got the control plane and apiserver accessible again.

Initial deployment attempt

  • rolling-update masters; just one master was updated and then fails to validate and cluster apiserver access is lost

recovery procedure I followed

  • ssh to updated master and check logs that all was good other than the /var/etcd*.log complaining about the version mismatch
  • rolling-update of non-updated two masters with --cloudonly

At this point there were several system-critical pods that were still not starting or running properly (weave, ebs-csi-controller, etc) but eventually the etcd upgrade did complete and the apiserver was again available. To fix the remaining pod startup issues, another rolling update of the masters / control plane seemed to fix that.

Assuming that I am not a kubernetes/kops/etcd committer, what is the proper way to recover from this failed upgrade? I have 1 of 3 masters on an integration test cluster upgraded to 1.22 but the api server is not accessible for anything, including the remaining 2 of 3 masters running 1.21.5.

Power ahead with upgrade – will all masters running 1.22 resolve this? Rollback to 1.21 somehow – I’ve never used kops to downgrade and not sure that it’s even a thing.

I would like to report that I observed the same issues and logs from Etcd when upgrading from 1.21 to 1.22. I was only able to recover/proceed by killing the entire control plane (effectively following the etcd restore process) and then cleaning up the extra master IPs from etcd.