raft-rs: re-joining after simulated node crash panics on trying to re-add an already existing node.
Describe the bug
Re-joining a leader to a cluster crashes with the attempt to re-add already known node to the progress state.
Oct 31 15:49:52.769 ERRO e: The node 2 already exists in the voters set., raft_id: 1
0: backtrace::backtrace::libunwind::trace
at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/libunwind.rs:88
1: backtrace::backtrace::trace_unsynchronized
at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/mod.rs:66
2: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:47
3: std::sys_common::backtrace::print
at src/libstd/sys_common/backtrace.rs:36
4: std::panicking::default_hook::{{closure}}
at src/libstd/panicking.rs:200
5: std::panicking::default_hook
at src/libstd/panicking.rs:214
6: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:477
7: std::panicking::continue_panic_fmt
at src/libstd/panicking.rs:384
8: rust_begin_unwind
at src/libstd/panicking.rs:311
9: core::panicking::panic_fmt
at src/libcore/panicking.rs:85
10: core::result::unwrap_failed
at src/libcore/result.rs:1084
11: core::result::Result<T,E>::unwrap
at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/libcore/result.rs:852
12: uring::raft_node::RaftNode::on_ready
at src/raft_node.rs:325
13: uring::loopy_thing
at src/main.rs:635
14: uring::main::{{closure}}
at src/main.rs:693
To Reproduce
I set up a mini demo to replicate:
- clone https://github.com/wayfair-incubator/uring/tree/2416031ac34759f002a9a1539b5a2a54bbd84946
- start node 1:
cargo run -- -e 127.0.0.1:8081 -i 1
- wait for it to elect itself leader
- start node 2:
cargo run -- -e 127.0.0.1:8082 -i 2 -p 127.0.0.1:8081
- wait for it to join the cluster
- start node 3:
cargo run -- -e 127.0.0.1:8083 -i 3 -p 127.0.0.1:8081
- wait for it to join the cluster.
- Terminate node 1 (leader)
CTRL+C
- node 2 or 3 will become leader
- restart node 1 and let it re-join the cluster
cargo run -- -e 127.0.0.1:8081 -i 01 -p 127.0.0.1:8082
Expected behavior Stopping and starting nodes in a cluster should be handled gracefully
System information (probably not relevant)
- CPU architecture: x86
- Distribution and kernel version: OS X 10.14.6
- SELinux on?: No
- Any other system details we should know?: no
Additional context The shared repo is a minimal demo app trying to put raft-rs into a re-state for a reft cluster.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 41 (18 by maintainers)
@Licenser If you don’t call
curl -X POST 127.0.0.1:8081/node/1
in your reproduction case it seems to work without panicking… At least raft is trying to send the correct messages but it seems to be unable to.