raft-rs: re-joining after simulated node crash panics on trying to re-add an already existing node.

Describe the bug

Re-joining a leader to a cluster crashes with the attempt to re-add already known node to the progress state.

Oct 31 15:49:52.769 ERRO e: The node 2 already exists in the voters set., raft_id: 1
   0: backtrace::backtrace::libunwind::trace
             at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:47
   3: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:36
   4: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:200
   5: std::panicking::default_hook
             at src/libstd/panicking.rs:214
   6: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:477
   7: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:384
   8: rust_begin_unwind
             at src/libstd/panicking.rs:311
   9: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  10: core::result::unwrap_failed
             at src/libcore/result.rs:1084
  11: core::result::Result<T,E>::unwrap
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/libcore/result.rs:852
  12: uring::raft_node::RaftNode::on_ready
             at src/raft_node.rs:325
  13: uring::loopy_thing
             at src/main.rs:635
  14: uring::main::{{closure}}
             at src/main.rs:693

To Reproduce

I set up a mini demo to replicate:

clone https://github.com/wayfair-incubator/uring/tree/2416031ac34759f002a9a1539b5a2a54bbd84946
start node 1: cargo run -- -e 127.0.0.1:8081 -i 1
wait for it to elect itself leader
start node 2: cargo run -- -e 127.0.0.1:8082 -i 2 -p 127.0.0.1:8081
wait for it to join the cluster
start node 3: cargo run -- -e 127.0.0.1:8083 -i 3 -p 127.0.0.1:8081
wait for it to join the cluster.
Terminate node 1 (leader) CTRL+C
node 2 or 3 will become leader
restart node 1 and let it re-join the cluster cargo run -- -e 127.0.0.1:8081 -i 01 -p 127.0.0.1:8082

Expected behavior Stopping and starting nodes in a cluster should be handled gracefully

System information (probably not relevant)

CPU architecture: x86
Distribution and kernel version: OS X 10.14.6
SELinux on?: No
Any other system details we should know?: no

Additional context The shared repo is a minimal demo app trying to put raft-rs into a re-state for a reft cluster.

About this issue

Original URL
State: open
Created 5 years ago
Comments: 41 (18 by maintainers)

Most upvoted comments

@Licenser If you don’t call curl -X POST 127.0.0.1:8081/node/1 in your reproduction case it seems to work without panicking… At least raft is trying to send the correct messages but it seems to be unable to.

Hoverbear on Nov 7, 2019