kops: etcd-manager seems to pick up incorrect etcdv3 version

1. What kops version are you running? The command kops version, will display this information.

kops 1.12

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.12.7

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

First created a cluster with etcd 3.1.11. Then per upgrade docs, migrated to 3.2.18. Confirmed the etcd pods were indeed running 3.2.18 Rolled masters fast following the guide.

5. What happened after the commands executed?

The masters did not come online. In /var/log/etcd.log I see entries like this:

I0515 10:04:21.595477    3021 etcdserver.go:553] overriding clientURLs with [http://etcd-a.internal.foo.bar:4001] (state had [http://0.0.0.0:4001])
I0515 10:04:21.595491    3021 etcdserver.go:557] overriding quarantinedClientURLs with [http://etcd-a.internal.foo.bar:3994] (state had [http://0.0.0.0:3994])
W0515 10:04:21.595499    3021 pki.go:46] not generating peer keypair as peers-ca not set
W0515 10:04:21.595504    3021 pki.go:84] not generating client keypair as clients-ca not set
W0515 10:04:21.595527    3021 etcdserver.go:92] error running etcd: unknown etcd version v3.1.11: not found in [/opt/etcd-v3.1.11-linux-amd64]```

Again, I confirm that the previous version running was 3.2.18. this is also mentioned in older etcd.log files.

The masters come back online when changing from etcd-manager to legacy.

**6. What did you expect to happen?**

Whatever is happening to understand that 3.2.18 is in use and migrate to manager.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 33 (19 by maintainers)

Commits related to this issue

Most upvoted comments

I have the same problem.

  • Kops 1.13.1, cloud-provider AWS
  • Kubernetes 1.12.10
  • etcd upgrade from 2.2.1 to 3.2.4

Interestingly, we upgraded around 10 small pre-production clusters without a single issue. When we got around to upgrading clusters in production, we hit this problem on the first one.

We had EBS snapshots and were able to roll back, and on a second attempt it worked.

Answering my own question: once you upgrade all master nodes simultaneously, all you need to do is:

  1. ssh to each master node
  2. cd to 2 mounted etcd volumes (one for main, another for events) cd /mnt/master-vol-02e4f7fb71a78b634/
  3. delete a state file which has a wrong etcd version in it and some “trash” directory rm -rf state data-trashcan
  4. reboot the node reboot

I have the same problem.

  • Kops 1.13.1, cloud-provider AWS
  • Kubernetes 1.12.10
  • etcd upgrade from 2.2.1 to 3.2.4

Interestingly, we upgraded around 10 small pre-production clusters without a single issue. When we got around to upgrading clusters in production, we hit this problem on the first one.

We had EBS snapshots and were able to roll back, and on a second attempt it worked.

Issue

  • K8s api logs
$ sudo tail -f kube-apiserver.log
I0323 22:43:47.796604       1 server.go:681] external host was not specified, using 172.18.64.111
I0323 22:43:47.797870       1 server.go:705] Initializing deserialization cache size based on 0MB limit
I0323 22:43:47.798003       1 server.go:724] Initializing cache sizes based on 0MB limit
I0323 22:43:47.798293       1 server.go:152] Version: v1.12.10
W0323 22:43:48.166697       1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0323 22:43:48.167373       1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook.
I0323 22:43:48.167392       1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
W0323 22:43:48.167954       1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0323 22:43:48.168352       1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook.
I0323 22:43:48.168366       1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
F0323 22:44:08.171850       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt true true 1000 0xc420b06000 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:4001: connect: connection refused)
  • ectd couldn’t talk to the k8s api. -> err (dial tcp 127.0.0.1:4001: connect: connection refused)
$ curl 127.0.0.1:4001
curl: (7) Failed to connect to 127.0.0.1 port 4001: Connection refused
  • etcd log
$ tail -f etcd.log
    etcdClusterPeerInfo{peer=peer{id:"etcd-c" endpoints:"172.18.68.147:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-c" peer_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:2380" client_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:4001" quarantined_client_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:3994" > }
I0323 22:46:31.854692    3232 controller.go:277] etcd cluster members: map[]
I0323 22:46:31.854704    3232 controller.go:615] sending member map to all peers:
I0323 22:46:31.856202    3232 commands.go:22] not refreshing commands - TTL not hit
I0323 22:46:31.856227    3232 s3fs.go:220] Reading file "s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control/etcd-cluster-created"
I0323 22:46:31.888074    3232 controller.go:369] spec member_count:3 etcd_version:"3.3.13"
I0323 22:46:31.888137    3232 commands.go:25] refreshing commands
I0323 22:46:31.909745    3232 vfs.go:104] listed commands in s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control: 0 commands
I0323 22:46:31.909779    3232 s3fs.go:220] Reading file "s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control/etcd-cluster-spec"
W0323 22:46:31.919561    3232 controller.go:149] unexpected error running etcd cluster reconciliation loop: etcd has 0 members registered; must issue restore-backup command to proceed

Solution

Same as quoted at the begging, with the following versions though:

  • kops 1.12.3
  • k8s 1.12.10
  • etcd 3.3.13

👉 on a second attempt it worked ✔️ 🏁

Considerations

  • It’s pretty disappointing that the problem occurs because of some kind of timing issue?
  • Our guess is that if something happens before something else, it works, otherwise it’s not, so some kind of dependency condition validation it’s not enforced.
  • As stated on the before comments the problem seems to be with kops etcd-manager (etcd3 kops support from version >= 1.12.x.
  • We happen to be in a version of kops, kuber and etcd3 combination which is prone to error.
  • We haven’t see this behaviour in later releases (>= 1.13.x), most probably has already been solved.

Ref-Link: https://github.com/kubernetes/kops/blob/master/docs/etcd3-migration.md

CC: @diego-ojeda-binbash @gdmlnx