etcd: panic: tocommit(138789) is out of range [lastIndex(0)]
On 2.2.0 in clustered setup with two static members I’ve initialised second member using ETCD_INITIAL_CLUSTER_STATE="existing"
and ETCD_INITIAL_CLUSTER
variables. Second member was successfully restarted several times. Then to simulate data loss scenario I’ve stopped second member and deleted its data files. Attempted (re-)start ended up with crash and a long backrace:
2015-09-13 00:45:30.785618 I | rafthttp: the connection with dd3b0cb0cf32bcef became active
2015-09-13 00:45:30.786353 I | raft: c8a988b91c9c29a1 [term: 1] received a MsgApp message with higher term from dd3b0cb0cf32bcef [term: 1676]
2015-09-13 00:45:30.786421 I | raft: c8a988b91c9c29a1 became follower at term 1676
2015-09-13 00:45:30.786491 I | raft: raft.node: c8a988b91c9c29a1 elected leader dd3b0cb0cf32bcef at term 1676
2015-09-13 00:45:30.786509 C | raft: tocommit(138789) is out of range [lastIndex(0)]
panic: tocommit(138789) is out of range [lastIndex(0)]
[...]
I expected second member to be able to recover and re-load data (just like it initialised/loaded its data from first member initially).
Even if it such scenario is not intended to work, exit with meaningful error is better than crash.
Also I can’t recover from this situation as cluster now is unhealthy and attempt to etcdctl member remove
an unavailable (crashing) member fails with Recieved an error trying to remove member c8a988b91c9c29a1: client: etcd cluster is unavailable or misconfigured
. It feels fragile as loss of any member (of two) leaves cluster without obvious means to recover or switch to non-clustered (single daemon) setup.
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Comments: 18 (17 by maintainers)
Commits related to this issue
- raft: improve panic error message Give a human being some insight into how we might have gotten to this state based on feedback from #3504. — committed to philips/etcd by deleted user 9 years ago
- raft: improve panic error message Give a human being some insight into how we might have gotten to this state based on feedback from #3504. — committed to mwitkow/etcd by deleted user 9 years ago
- raft: improve panic error message Give a human being some insight into how we might have gotten to this state based on feedback from #3504. — committed to yichengq/etcd by deleted user 9 years ago
With all due respect you should be paying more attention to bug report(s). I’ve been testing probable situation when data folder is gone (it can happen due to various reasons) while configuration is not. I did not say anything about production and frankly presence of problems I highlighted is not very helpful to build confidence in etcd technology…