etcd: panic: tocommit(138789) is out of range [lastIndex(0)]

On 2.2.0 in clustered setup with two static members I’ve initialised second member using ETCD_INITIAL_CLUSTER_STATE="existing" and ETCD_INITIAL_CLUSTER variables. Second member was successfully restarted several times. Then to simulate data loss scenario I’ve stopped second member and deleted its data files. Attempted (re-)start ended up with crash and a long backrace:

2015-09-13 00:45:30.785618 I | rafthttp: the connection with dd3b0cb0cf32bcef became active
2015-09-13 00:45:30.786353 I | raft: c8a988b91c9c29a1 [term: 1] received a MsgApp message with higher term from dd3b0cb0cf32bcef [term: 1676]
2015-09-13 00:45:30.786421 I | raft: c8a988b91c9c29a1 became follower at term 1676
2015-09-13 00:45:30.786491 I | raft: raft.node: c8a988b91c9c29a1 elected leader dd3b0cb0cf32bcef at term 1676
2015-09-13 00:45:30.786509 C | raft: tocommit(138789) is out of range [lastIndex(0)]
panic: tocommit(138789) is out of range [lastIndex(0)]
[...]

I expected second member to be able to recover and re-load data (just like it initialised/loaded its data from first member initially). Even if it such scenario is not intended to work, exit with meaningful error is better than crash. Also I can’t recover from this situation as cluster now is unhealthy and attempt to etcdctl member remove an unavailable (crashing) member fails with Recieved an error trying to remove member c8a988b91c9c29a1: client: etcd cluster is unavailable or misconfigured. It feels fragile as loss of any member (of two) leaves cluster without obvious means to recover or switch to non-clustered (single daemon) setup.

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 18 (17 by maintainers)

Commits related to this issue

Most upvoted comments

You should read the docs before you try to build up your etcd cluster and put it into production. I think our docs covers most of your questions.

With all due respect you should be paying more attention to bug report(s). I’ve been testing probable situation when data folder is gone (it can happen due to various reasons) while configuration is not. I did not say anything about production and frankly presence of problems I highlighted is not very helpful to build confidence in etcd technology…