etcd: Unable to recover etcd cluster after loss of nodes
I am running the CoreOS alpha channel (349.0) on EC2 (non-VPC) with no modifications to the CloudFormation cloud-config section for etcd. I updated my CloudFormation stack, which requires the EC2 instances to be terminated and recreated, but which is done using a rolling update with MinInstancesInService=3 and should maintain the etcd cluster.
However, after the CF stack update, the etcd cluster was left in a bad state. 1 node (call it A) at least appears to have the snapshot:
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.079 INFO | 3195460fb2fe42bd92439e261c84b859 attempted to join via 10.220.1.118:7001 failed: fail checking join version: Client Internal Error (Get http://10.220.1.118:7001/version: dial tcp 10.220.1.118:7001: i/o timeout)
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.452 INFO | Send Join Request to http://10.41.9.209:7001/join
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.454 INFO | 3195460fb2fe42bd92439e261c84b859 attempted to join via 10.41.9.209:7001 failed: fail on join request: Raft Internal Error ()
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.470 WARNING | 3195460fb2fe42bd92439e261c84b859 cannot connect to previous cluster [10.238.41.24:7001 10.41.3.254:7001 10.220.27.188:7001 10.41.7.202:7001 10.41.3.254:7001 10.41.3.28:7001 10.41.11.168:7001 10.220.27.188:7001 10.220.145.240:7001 10.41.3.250:7001 10.41.3.28:7001 10.217.200.55:7001 10.238.39.228:7001 10.238.12.241:7001 10.238.41.24:7001 10.231.156.16:7001 10.220.146.186:7001 10.220.1.118:7001 10.41.9.209:7001]: fail joining the cluster via given peers after 1 retries
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO | etcd server [name 3195460fb2fe42bd92439e261c84b859, listen on :4001, advertised url http://10.234.59.141:4001]
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO | peer server [name 3195460fb2fe42bd92439e261c84b859, listen on :7001, advertised url http://10.234.59.141:7001]
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO | 3195460fb2fe42bd92439e261c84b859 starting in peer mode
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO | 3195460fb2fe42bd92439e261c84b859: state changed from 'initialized' to 'follower'.
Jun 20 01:00:18 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:18.029 INFO | 3195460fb2fe42bd92439e261c84b859: state changed from 'follower' to 'candidate'.
Jun 20 01:00:19 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:19.472 INFO | 3195460fb2fe42bd92439e261c84b859: snapshot of 179979 events at index 179979 completed
But the other 2 (call them B and C) continually fail with Raft Internal Error ()
when attempting to join the cluster composed of A:
Jun 20 01:27:36 ip-10-41-9-209 etcd[1912]: [etcd] Jun 20 01:27:36.097 INFO | 39e1d196a13d42dd8cccfb8d3cc3040f attempted to join via 10.234.59.141:7001 failed: fail on join request: Raft Internal Error ()
@philips walked me through a recovery process that was unsuccessful and asked me to post this issue here. The recovery process consisted of:
- Finding previous machine IDs by running
curl 127.0.0.1:4001/v2/keys/_etcd/machines
on A - On B and C, wiping the /var/lib/etcd dirs and changing their
etcd -name=
to machine IDs that existed in the machines JSON obtained in (1), by editing the file /var/run/systemd/system/etcd.service.d/20-cloudinit.conf - Restarting B and C
After performing these steps and verifying the other 2 nodes’ logs have the changed names, 1 of them fails with the same ‘Raft Internal Error ()’ and the other fails with <machine id i changed it go> is not allowed to join the cluster with existing URL http://10.41.9.209:7001 [its IP]'. Node A did not emit any more log lines during this whole process.
The data dir on node A contains various development credentials. I can revoke most of them but some of them I’m not sure about. Folks from CoreOS or other maintainers can email me at sqs@sourcegraph.com if you want me to send it to you (but I don’t want to post it on the web).
About this issue
- Original URL
- State: closed
- Created 10 years ago
- Comments: 41 (10 by maintainers)
+1, I am having similar issues.
I run a cluster of 3 etcd servers. I want to update the machine image backing them (we are in AWS, so a new AMI) so I intend to do a rolling restart of the cluster. But eventually I always end up in a situation where some number of nodes are serving client requests, but refusing to let new cluster members join:
Soon thereafter the existing nodes started to have serious troubles too: