k3s: Unable to add a node after removing a failed node using embedded etcd

Environmental Info: K3s Version: k3s version v1.19.5+k3s1 (b11612e2)

Node(s) CPU architecture, OS, and Version: Linux master-02 5.4.0-56-generic #62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: Originally was 3 master, then one died so was removed now running 2 masters but cannot readd a 3rd master

Describe the bug: When trying to add the 3rd master I am getting the following error

Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.903306802Z" level=info msg="Adding https://192.168.0.13:2380 to etcd cluster [master-01-9d17c397=https://192.168.0.11:2380 master-03-e2ec81cc=https://192.168.0.13:2380 master-02-8a1215a5=https://192.168.0.12:2380]"
Dec 21 15:00:45 master-03 k3s[1952]: {"level":"warn","ts":"2020-12-21T15:00:45.905Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-247cfa63-67b2-477d-8202-8891774f3d4f/192.168.0.11:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.906496636Z" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"

Strangely it looks like the cluster is still trying to communicate with the dead node although it is no longer under nodes

[Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.711Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.712Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://192.168.0.13:2380/version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}
root@master-02:/home/jrote1# kubectl get node
NAME        STATUS   ROLES         AGE     VERSION
master-01   Ready    etcd,master   22d     v1.19.5+k3s1
master-02   Ready    etcd,master   6d19h   v1.19.5+k3s1

Steps To Reproduce:

  • Installed K3s
  • Add 3 master nodes
  • Master 3 dies so remove from nodes
  • Attempted to re add master 3

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 22 (14 by maintainers)

Most upvoted comments

I have the same issue but in a different environment:

  • arch: arm64
  • version: v1.19.4+k3s

I used image from rancher/coreos-etcd on docker hub : https://hub.docker.com/r/rancher/coreos-etcd/tags?page=1&ordering=last_updated

Actually last tag for my arch is : v3.4.13-arm64

I add nodeSelector’s rule in order to deploy pod on a working node with the role named: etcd

kubectl run --rm --tty --stdin --image docker.io/rancher/coreos-etcd:v3.4.13-arm64 etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/sh"],"image":"docker.io/rancher/coreos-etcd:v3.4.13-arm64","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}],"nodeSelector":{"node-role.kubernetes.io/etcd":"true"}}}'

you can list etcd members:

etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member list

you can remove member

etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove 1234567890ABCDEF

uninstall k3s

k3s-uninstall.sh

and add it again into your cluster. It’s works !!!

As a temporary workaround, you should be able to do the following on one of the working nodes:

kubectl run --rm --tty --stdin --image docker.io/bitnami/etcd:latest etcdctl --overrides='{"apiVersion":"v1","kind":"Pod","spec":{"hostNetwork":true,"restartPolicy":"Never","securityContext":{"runAsUser":0,"runAsGroup":0},"containers":[{"command":["/bin/bash"],"image":"docker.io/bitnami/etcd:latest","name":"etcdctl","stdin":true,"stdinOnce":true,"tty":true,"volumeMounts":[{"mountPath":"/var/lib/rancher","name":"var-lib-rancher"}]}],"volumes":[{"name":"var-lib-rancher","hostPath":{"path":"/var/lib/rancher","type":"Directory"}}]}}'

In the resulting shell, run the following command:

./bin/etcdctl --key /var/lib/rancher/k3s/server/tls/etcd/client.key --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt member remove master-03-e2ec81cc

Waiting on another PR to land and then I’m going to rework etcd cluster membership cleanup.