etcd: Cannot restart after shutdown / Ctrl+C (etcdmain: database file does not match with snapshot)
I have a cluster of 3 and Iām testing failover scenarios right now. I simulate power outages (shutting of a VM) and service failures (Ctrl+C on etcd). As of now, shutting off an instance has terrible consequences, as it cannot recover by simply rebooting a machine / restarting the service:
$ etcd
// ...runs...
<Press Ctrl+C>
$ etcd
2016-07-03 02:05:21.753663 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://rly4:2379
2016-07-03 02:05:21.754254 I | flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd/data
2016-07-03 02:05:21.754550 I | flags: recognized and used environment variable ETCD_DEBUG=true
2016-07-03 02:05:21.754805 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://rly4:2380
2016-07-03 02:05:21.755066 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=rly4=http://rly4:2380,rly1=http://rly1:2380,rly5=http://rly5:2380
2016-07-03 02:05:21.755317 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
2016-07-03 02:05:21.755585 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=rly-cluster-1
2016-07-03 02:05:21.755833 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://rly4:2379,http://127.0.0.1:2379
2016-07-03 02:05:21.756038 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://rly4:2380
2016-07-03 02:05:21.756276 I | flags: recognized and used environment variable ETCD_NAME=rly4
2016-07-03 02:05:21.756525 I | flags: recognized and used environment variable ETCD_WAL_DIR=/var/lib/etcd/wal
2016-07-03 02:05:21.756784 I | etcdmain: etcd Version: 3.0.0
2016-07-03 02:05:21.757038 I | etcdmain: Git SHA: 6f48bda
2016-07-03 02:05:21.757261 I | etcdmain: Go Version: go1.6.2
2016-07-03 02:05:21.757478 I | etcdmain: Go OS/Arch: linux/amd64
2016-07-03 02:05:21.757696 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2016-07-03 02:05:21.757950 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2016-07-03 02:05:21.761023 I | etcdmain: listening for peers on http://rly4:2380
2016-07-03 02:05:21.761065 I | etcdmain: listening for client requests on 127.0.0.1:2379
2016-07-03 02:05:21.763682 I | etcdmain: listening for client requests on rly4:2379
2016-07-03 02:05:21.804445 I | etcdserver: recovered store from snapshot at index 141365
2016-07-03 02:05:21.804844 I | etcdserver: name = rly4
2016-07-03 02:05:21.805118 I | etcdserver: data dir = /var/lib/etcd/data
2016-07-03 02:05:21.805389 I | etcdserver: member dir = /var/lib/etcd/data/member
2016-07-03 02:05:21.805661 I | etcdserver: dedicated WAL dir = /var/lib/etcd/wal
2016-07-03 02:05:21.805891 I | etcdserver: heartbeat = 100ms
2016-07-03 02:05:21.806169 I | etcdserver: election = 1000ms
2016-07-03 02:05:21.806428 I | etcdserver: snapshot count = 10000
2016-07-03 02:05:21.806696 I | etcdserver: advertise client URLs = http://rly4:2379
2016-07-03 02:05:21.807377 I | etcdserver: restarting member 89889c482441ec6e in cluster d0ce954dc5f082d0 at commit index 141549
2016-07-03 02:05:21.807753 I | raft: 89889c482441ec6e became follower at term 1966
2016-07-03 02:05:21.808041 I | raft: newRaft 89889c482441ec6e [peers: [89889c482441ec6e,d27323ab9b295f50,e790e6b697b3c219], term: 1966, commit: 141549, applied: 141365, lastindex: 141549, lastterm: 1966]
2016-07-03 02:05:21.808406 I | membership: added member d27323ab9b295f50 [http://rly5:2380] to cluster d0ce954dc5f082d0 from store
2016-07-03 02:05:21.808751 I | membership: added member e790e6b697b3c219 [http://rly1:2380] to cluster d0ce954dc5f082d0 from store
2016-07-03 02:05:21.809021 I | membership: added member 89889c482441ec6e [http://rly4:2380] to cluster d0ce954dc5f082d0 from store
2016-07-03 02:05:21.809295 I | membership: set the cluster version to 3.0 from store
2016-07-03 02:05:21.818080 I | etcdmain: stopping listening for client requests on rly4:2379
2016-07-03 02:05:21.818511 I | etcdmain: stopping listening for client requests on rly4:2379
2016-07-03 02:05:21.818707 I | etcdmain: stopping listening for peers on http://rly4:2380
2016-07-03 02:05:21.818894 C | etcdmain: database file (/var/lib/etcd/data/member/snap/db index 140014) does not match with snapshot (index 141365).
$ find
./data
./data/member
./data/member/snap
./data/member/snap/00000000000007ae-0000000000022835.snap
./data/member/snap/db
./wal
./wal/0000000000000000-0000000000000000.wal
./wal/0.tmp
The only way to recover from this right now is to delete the data+wal dir, remove the member, re-add it and then restart the daemon. However, I fear that if all fail at the same time (power failure), I will lose all data.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 17 (8 by maintainers)
Commits related to this issue
- README: add more instructions on building etcd Address - https://github.com/coreos/etcd/issues/5857#issuecomment-230174840 — committed to gyuho/etcd by gyuho 8 years ago
- Documentation: add instruction on vendoring, build Addressing https://github.com/coreos/etcd/issues/5857#issuecomment-230174840. — committed to gyuho/etcd by gyuho 8 years ago
@gyuho I finally got it to build. Turns out the PPA version of go is needed. The one in Ubuntu 14.04 upstream does not work. Here are my instructions (if you want to add them somewhere):
@xiang90 I can confirm that your fix worked.
Just wanted to say thanks. You guys have been great! Blazing fast responses, awesome tool!! Much appreciated š