etcd: Unable to recover etcd cluster after loss of nodes

I am running the CoreOS alpha channel (349.0) on EC2 (non-VPC) with no modifications to the CloudFormation cloud-config section for etcd. I updated my CloudFormation stack, which requires the EC2 instances to be terminated and recreated, but which is done using a rolling update with MinInstancesInService=3 and should maintain the etcd cluster.

However, after the CF stack update, the etcd cluster was left in a bad state. 1 node (call it A) at least appears to have the snapshot:

Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.079 INFO      | 3195460fb2fe42bd92439e261c84b859 attempted to join via 10.220.1.118:7001 failed: fail checking join version: Client Internal Error (Get http://10.220.1.118:7001/version: dial tcp 10.220.1.118:7001: i/o timeout)
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.452 INFO      | Send Join Request to http://10.41.9.209:7001/join
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.454 INFO      | 3195460fb2fe42bd92439e261c84b859 attempted to join via 10.41.9.209:7001 failed: fail on join request: Raft Internal Error ()
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.470 WARNING   | 3195460fb2fe42bd92439e261c84b859 cannot connect to previous cluster [10.238.41.24:7001 10.41.3.254:7001 10.220.27.188:7001 10.41.7.202:7001 10.41.3.254:7001 10.41.3.28:7001 10.41.11.168:7001 10.220.27.188:7001 10.220.145.240:7001 10.41.3.250:7001 10.41.3.28:7001 10.217.200.55:7001 10.238.39.228:7001 10.238.12.241:7001 10.238.41.24:7001 10.231.156.16:7001 10.220.146.186:7001 10.220.1.118:7001 10.41.9.209:7001]: fail joining the cluster via given peers after 1 retries
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | etcd server [name 3195460fb2fe42bd92439e261c84b859, listen on :4001, advertised url http://10.234.59.141:4001]
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | peer server [name 3195460fb2fe42bd92439e261c84b859, listen on :7001, advertised url http://10.234.59.141:7001]
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | 3195460fb2fe42bd92439e261c84b859 starting in peer mode
Jun 20 01:00:16 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:16.471 INFO      | 3195460fb2fe42bd92439e261c84b859: state changed from 'initialized' to 'follower'.
Jun 20 01:00:18 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:18.029 INFO      | 3195460fb2fe42bd92439e261c84b859: state changed from 'follower' to 'candidate'.
Jun 20 01:00:19 ip-10-234-59-141 etcd[568]: [etcd] Jun 20 01:00:19.472 INFO      | 3195460fb2fe42bd92439e261c84b859: snapshot of 179979 events at index 179979 completed

But the other 2 (call them B and C) continually fail with Raft Internal Error () when attempting to join the cluster composed of A:

Jun 20 01:27:36 ip-10-41-9-209 etcd[1912]: [etcd] Jun 20 01:27:36.097 INFO      | 39e1d196a13d42dd8cccfb8d3cc3040f attempted to join via 10.234.59.141:7001 failed: fail on join request: Raft Internal Error ()

@philips walked me through a recovery process that was unsuccessful and asked me to post this issue here. The recovery process consisted of:

Finding previous machine IDs by running curl 127.0.0.1:4001/v2/keys/_etcd/machines on A
On B and C, wiping the /var/lib/etcd dirs and changing their etcd -name= to machine IDs that existed in the machines JSON obtained in (1), by editing the file /var/run/systemd/system/etcd.service.d/20-cloudinit.conf
Restarting B and C

After performing these steps and verifying the other 2 nodes’ logs have the changed names, 1 of them fails with the same ‘Raft Internal Error ()’ and the other fails with <machine id i changed it go> is not allowed to join the cluster with existing URL http://10.41.9.209:7001 [its IP]'. Node A did not emit any more log lines during this whole process.

The data dir on node A contains various development credentials. I can revoke most of them but some of them I’m not sure about. Folks from CoreOS or other maintainers can email me at sqs@sourcegraph.com if you want me to send it to you (but I don’t want to post it on the web).

About this issue

Original URL
State: closed
Created 10 years ago
Comments: 41 (10 by maintainers)

Most upvoted comments

+1, I am having similar issues.

I run a cluster of 3 etcd servers. I want to update the machine image backing them (we are in AWS, so a new AMI) so I intend to do a rolling restart of the cluster. But eventually I always end up in a situation where some number of nodes are serving client requests, but refusing to let new cluster members join:

[etcd] Jul  8 21:20:22.739 INFO      | Discovery via https://discovery.etcd.io using prefix /b0d0e7798e882d327af1953f0eb977bf.
[etcd] Jul  8 21:20:23.000 INFO      | Discovery found peers [http://ip-10-70-6-77.us-west-2.compute.internal:31908 http://ip-10-70-6-249.us-west-2.compute.internal:31801 http://ip-10-70-6-77.us-west-2.compute.internal:31156 http://ip-10-70-6-235.us-west-2.compute.internal:31023 http://ip-10-70-6-233.us-west-2.compute.internal:31878 http://ip-10-70-6-235.us-west-2.compute.internal:31004]
[etcd] Jul  8 21:20:23.000 INFO      | Discovery fetched back peer list: [ip-10-70-6-77.us-west-2.compute.internal:31908 ip-10-70-6-249.us-west-2.compute.internal:31801 ip-10-70-6-77.us-west-2.compute.internal:31156 ip-10-70-6-235.us-west-2.compute.internal:31023 ip-10-70-6-233.us-west-2.compute.internal:31878 ip-10-70-6-235.us-west-2.compute.internal:31004]
[etcd] Jul  8 21:20:23.008 INFO      | Send Join Request to http://ip-10-70-6-77.us-west-2.compute.internal:31908/join
[etcd] Jul  8 21:20:23.012 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-77.us-west-2.compute.internal:31908 failed: fail on join request: Raft Internal Error ()
[etcd] Jul  8 21:20:23.015 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-249.us-west-2.compute.internal:31801 failed: fail checking join version: Client Internal Error (Get http://ip-10-70-6-249.us-west-2.compute.internal:31801/version: dial tcp 10.70.6.249:31801: connection refused)
[etcd] Jul  8 21:20:23.018 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-77.us-west-2.compute.internal:31156 failed: fail checking join version: Client Internal Error (Get http://ip-10-70-6-77.us-west-2.compute.internal:31156/version: dial tcp 10.70.6.77:31156: connection refused)
[etcd] Jul  8 21:20:23.022 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-235.us-west-2.compute.internal:31023 failed: fail checking join version: Client Internal Error (Get http://ip-10-70-6-235.us-west-2.compute.internal:31023/version: dial tcp 10.70.6.235:31023: connection refused)
[etcd] Jul  8 21:20:23.030 INFO      | Send Join Request to http://ip-10-70-6-233.us-west-2.compute.internal:31878/join
[etcd] Jul  8 21:20:23.035 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-233.us-west-2.compute.internal:31878 failed: fail on join request: Raft Internal Error ()
[etcd] Jul  8 21:20:23.043 INFO      | Send Join Request to http://ip-10-70-6-235.us-west-2.compute.internal:31004/join
[etcd] Jul  8 21:20:23.047 INFO      | 8f5d66c2eb0d attempted to join via ip-10-70-6-235.us-west-2.compute.internal:31004 failed: fail on join request: Raft Internal Error ()
[etcd] Jul  8 21:20:23.047 INFO      | 8f5d66c2eb0d is unable to join the cluster using any of the peers [ip-10-70-6-77.us-west-2.compute.internal:31908 ip-10-70-6-249.us-west-2.compute.internal:31801 ip-10-70-6-77.us-west-2.compute.internal:31156 ip-10-70-6-235.us-west-2.compute.internal:31023 ip-10-70-6-233.us-west-2.compute.internal:31878 ip-10-70-6-235.us-west-2.compute.internal:31004] at 0th time. Retrying in 3.7 seconds

Soon thereafter the existing nodes started to have serious troubles too:

[etcd] Jul  8 21:26:49.923 INFO      | f86a9a869d1c: state changed from 'follower' to 'candidate'.
[etcd] Jul  8 21:26:50.496 INFO      | f86a9a869d1c: state changed from 'candidate' to 'follower'.
[etcd] Jul  8 21:26:50.496 INFO      | f86a9a869d1c: term #8234 started.
[etcd] Jul  8 21:26:50.711 INFO      | f86a9a869d1c: term #8235 started.
[etcd] Jul  8 21:26:50.927 INFO      | f86a9a869d1c: term #8236 started.
[etcd] Jul  8 21:26:51.214 INFO      | f86a9a869d1c: term #8237 started.
[etcd] Jul  8 21:26:51.418 INFO      | f86a9a869d1c: term #8238 started.
[etcd] Jul  8 21:26:51.684 INFO      | f86a9a869d1c: term #8239 started.
[etcd] Jul  8 21:26:51.943 INFO      | f86a9a869d1c: term #8240 started.
[etcd] Jul  8 21:26:52.186 INFO      | f86a9a869d1c: term #8241 started.
[etcd] Jul  8 21:26:52.518 INFO      | f86a9a869d1c: term #8242 started.
[etcd] Jul  8 21:26:52.787 INFO      | f86a9a869d1c: state changed from 'follower' to 'candidate'.
[etcd] Jul  8 21:26:53.003 INFO      | f86a9a869d1c: state changed from 'candidate' to 'follower'.

stevenschlansker on Jul 8, 2014