rancher: Provisioning - etcd node exited in the middle of cluster provisioning resulting in cluster stuck in "updating" state forever.
Rancher server version - v2.1.0-rc2
Steps to reproduce the problem: Created a custom cluster with following node configurations: 2 control nodes 3 etcd nodes 3 worker nodes
The nodes were added to the cluster with some intervals between them. Cluster gets stuck in “provisioning” state forever.
[network] Host [18.224.52.227] is not able to connect to the following ports: [172.31.14.214:2379, 172.31.14.214:2380]. Please check network policies and firewall rules
etcd container in the etcd node that is in “active” state is is stopped state.
ubuntu@ip-172-31-14-214:~/.ssh$ docker ps -a | grep etcd
a1a4ca1ecc5d rancher/coreos-etcd:v3.2.18 "/usr/local/bin/et..." 34 minutes ago Exited (137) 30 minutes ago etcd
Etcd container logs:
2018-09-18 21:18:55.381036 I | etcdmain: etcd Version: 3.2.18
2018-09-18 21:18:55.381104 I | etcdmain: Git SHA: eddf599c6
2018-09-18 21:18:55.381109 I | etcdmain: Go Version: go1.8.7
2018-09-18 21:18:55.381115 I | etcdmain: Go OS/Arch: linux/amd64
2018-09-18 21:18:55.381120 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2018-09-18 21:18:55.381223 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-172-31-14-214.pem, key = /etc/kubernetes/ssl/kube-etcd-172-31-14-214-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2018-09-18 21:18:55.381845 I | embed: listening for peers on https://172.31.14.214:2380
2018-09-18 21:18:55.381916 I | embed: listening for client requests on 172.31.14.214:2379
2018-09-18 21:18:55.385099 I | etcdserver: name = etcd-ip-172-31-14-214
2018-09-18 21:18:55.385116 I | etcdserver: data dir = /var/lib/rancher/etcd/
2018-09-18 21:18:55.385122 I | etcdserver: member dir = /var/lib/rancher/etcd/member
2018-09-18 21:18:55.385126 I | etcdserver: heartbeat = 500ms
2018-09-18 21:18:55.385129 I | etcdserver: election = 5000ms
2018-09-18 21:18:55.385136 I | etcdserver: snapshot count = 100000
2018-09-18 21:18:55.385164 I | etcdserver: advertise client URLs = https://172.31.14.214:2379,https://172.31.14.214:4001
2018-09-18 21:18:55.385177 I | etcdserver: initial advertise peer URLs = https://172.31.14.214:2380
2018-09-18 21:18:55.385184 I | etcdserver: initial cluster = etcd-ip-172-31-14-214=https://172.31.14.214:2380
2018-09-18 21:18:55.388226 I | etcdserver: starting member a0531b4f9f600059 in cluster 17df64ae9ad121c3
2018-09-18 21:18:55.388259 I | raft: a0531b4f9f600059 became follower at term 0
2018-09-18 21:18:55.388283 I | raft: newRaft a0531b4f9f600059 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-09-18 21:18:55.388290 I | raft: a0531b4f9f600059 became follower at term 1
2018-09-18 21:18:55.394667 W | auth: simple token is not cryptographically signed
2018-09-18 21:18:55.397742 I | etcdserver: starting server... [version: 3.2.18, cluster version: to_be_decided]
2018-09-18 21:18:55.398510 I | etcdserver: a0531b4f9f600059 as single-node; fast-forwarding 9 ticks (election ticks 10)
2018-09-18 21:18:55.398688 I | etcdserver/membership: added member a0531b4f9f600059 [https://172.31.14.214:2380] to cluster 17df64ae9ad121c3
2018-09-18 21:18:55.398821 I | embed: ClientTLS: cert = /etc/kubernetes/ssl/kube-etcd-172-31-14-214.pem, key = /etc/kubernetes/ssl/kube-etcd-172-31-14-214-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2018-09-18 21:19:00.388562 I | raft: a0531b4f9f600059 is starting a new election at term 1
2018-09-18 21:19:00.388632 I | raft: a0531b4f9f600059 became candidate at term 2
2018-09-18 21:19:00.388649 I | raft: a0531b4f9f600059 received MsgVoteResp from a0531b4f9f600059 at term 2
2018-09-18 21:19:00.388741 I | raft: a0531b4f9f600059 became leader at term 2
2018-09-18 21:19:00.388757 I | raft: raft.node: a0531b4f9f600059 elected leader a0531b4f9f600059 at term 2
2018-09-18 21:19:00.389351 I | etcdserver: published {Name:etcd-ip-172-31-14-214 ClientURLs:[https://172.31.14.214:2379 https://172.31.14.214:4001]} to cluster 17df64ae9ad121c3
2018-09-18 21:19:00.389530 I | embed: ready to serve client requests
2018-09-18 21:19:00.389814 I | etcdserver: setting up the initial cluster version to 3.2
2018-09-18 21:19:00.389924 I | embed: serving client requests on 172.31.14.214:2379
2018-09-18 21:19:00.390468 N | etcdserver/membership: set the initial cluster version to 3.2
2018-09-18 21:19:00.390555 I | etcdserver/api: enabled capabilities for version 3.2
2018-09-18 21:19:57.409032 W | wal: sync duration of 1.022211112s, expected less than 1s
2018-09-18 21:19:58.138730 W | etcdserver: apply entries took too long [663.319249ms for 2 entries]
2018-09-18 21:19:58.138824 W | etcdserver: avoid queries with large range/delete range!
2018-09-18 21:21:46.115779 I | etcdmain: rejected connection from "172.31.10.255:37195" (error "EOF", ServerName "")
2018-09-18 21:21:46.123413 I | etcdmain: rejected connection from "172.31.14.214:41592" (error "EOF", ServerName "")
2018-09-18 21:21:53.415960 I | etcdserver/membership: added member ac5cf115267ce64f [https://172.31.10.255:2380] to cluster 17df64ae9ad121c3
2018-09-18 21:21:53.415989 I | rafthttp: starting peer ac5cf115267ce64f...
2018-09-18 21:21:53.416004 I | rafthttp: started HTTP pipelining with peer ac5cf115267ce64f
2018-09-18 21:21:53.416394 I | rafthttp: started streaming with peer ac5cf115267ce64f (writer)
2018-09-18 21:21:53.416643 I | rafthttp: started streaming with peer ac5cf115267ce64f (writer)
2018-09-18 21:21:53.416881 I | rafthttp: started peer ac5cf115267ce64f
2018-09-18 21:21:53.416902 I | rafthttp: added peer ac5cf115267ce64f
2018-09-18 21:21:53.416968 I | rafthttp: started streaming with peer ac5cf115267ce64f (stream Message reader)
2018-09-18 21:21:53.417073 I | rafthttp: started streaming with peer ac5cf115267ce64f (stream MsgApp v2 reader)
2018-09-18 21:21:55.391807 I | rafthttp: peer ac5cf115267ce64f became active
2018-09-18 21:21:55.391830 I | rafthttp: established a TCP streaming connection with peer ac5cf115267ce64f (stream Message writer)
2018-09-18 21:21:55.392179 I | rafthttp: established a TCP streaming connection with peer ac5cf115267ce64f (stream MsgApp v2 writer)
2018-09-18 21:21:55.451090 I | rafthttp: established a TCP streaming connection with peer ac5cf115267ce64f (stream MsgApp v2 reader)
2018-09-18 21:21:55.452205 I | rafthttp: established a TCP streaming connection with peer ac5cf115267ce64f (stream Message reader)
2018-09-18 21:21:56.027556 N | pkg/osutil: received terminated signal, shutting down...
WARNING: 2018/09/18 21:21:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 172.31.14.214:2379: getsockopt: connection refused"; Reconnecting to {172.31.14.214:2379 0 <nil>}
WARNING: 2018/09/18 21:21:57 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 172.31.14.214:2379: getsockopt: connection refused"; Reconnecting to {172.31.14.214:2379 0 <nil>}
WARNING: 2018/09/18 21:21:58 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 172.31.14.214:2379: getsockopt: connection refused"; Reconnecting to {172.31.14.214:2379 0 <nil>}
WARNING: 2018/09/18 21:22:01 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 172.31.14.214:2379: getsockopt: connection refused"; Reconnecting to {172.31.14.214:2379 0 <nil>}
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 18 (10 by maintainers)
@alena1108 @sangeethah i think i found the reason of etcd being broken, we follow these steps to add a member:
etcdctl member add
sometimes if the new member started first it needs some time to catch up with the cluster, if the node started and then the other nodes restarted quickly afterwards it breaks the cluster causing split brain, i put a fix to fix the order of reloading etcd nodes so that the new node will be restarted last giving it a chance to catch up with the cluster.