k3s: Unable to add a node after removing a failed node using embedded etcd
Environmental Info: K3s Version: k3s version v1.19.5+k3s1 (b11612e2)
Node(s) CPU architecture, OS, and Version: Linux master-02 5.4.0-56-generic #62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: Originally was 3 master, then one died so was removed now running 2 masters but cannot readd a 3rd master
Describe the bug: When trying to add the 3rd master I am getting the following error
Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.903306802Z" level=info msg="Adding https://192.168.0.13:2380 to etcd cluster [master-01-9d17c397=https://192.168.0.11:2380 master-03-e2ec81cc=https://192.168.0.13:2380 master-02-8a1215a5=https://192.168.0.12:2380]"
Dec 21 15:00:45 master-03 k3s[1952]: {"level":"warn","ts":"2020-12-21T15:00:45.905Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-247cfa63-67b2-477d-8202-8891774f3d4f/192.168.0.11:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
Dec 21 15:00:45 master-03 k3s[1952]: time="2020-12-21T15:00:45.906496636Z" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"
Strangely it looks like the cluster is still trying to communicate with the dead node although it is no longer under nodes
[Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.711Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.712Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"787a9a69bcefecea","rtt":"0s","error":"dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://192.168.0.13:2380/version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}
Dec 21 15:09:45 master-02 k3s[2448291]: {"level":"warn","ts":"2020-12-21T15:09:45.727Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"787a9a69bcefecea","error":"Get \"https://192.168.0.13:2380/version\": dial tcp 192.168.0.13:2380: connect: connection refused"}
root@master-02:/home/jrote1# kubectl get node
NAME STATUS ROLES AGE VERSION
master-01 Ready etcd,master 22d v1.19.5+k3s1
master-02 Ready etcd,master 6d19h v1.19.5+k3s1
Steps To Reproduce:
- Installed K3s
- Add 3 master nodes
- Master 3 dies so remove from nodes
- Attempted to re add master 3
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 22 (14 by maintainers)
I have the same issue but in a different environment:
I used image from rancher/coreos-etcd on docker hub : https://hub.docker.com/r/rancher/coreos-etcd/tags?page=1&ordering=last_updated
Actually last tag for my arch is : v3.4.13-arm64
I add nodeSelector’s rule in order to deploy pod on a working node with the role named: etcd
you can list etcd members:
you can remove member
uninstall k3s
and add it again into your cluster. It’s works !!!
As a temporary workaround, you should be able to do the following on one of the working nodes:
In the resulting shell, run the following command:
Waiting on another PR to land and then I’m going to rework etcd cluster membership cleanup.