raft-rs: re-joining after simulated node crash panics on trying to re-add an already existing node.

Describe the bug

Re-joining a leader to a cluster crashes with the attempt to re-add already known node to the progress state.

Oct 31 15:49:52.769 ERRO e: The node 2 already exists in the voters set., raft_id: 1
   0: backtrace::backtrace::libunwind::trace
             at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /Users/vsts/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:47
   3: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:36
   4: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:200
   5: std::panicking::default_hook
             at src/libstd/panicking.rs:214
   6: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:477
   7: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:384
   8: rust_begin_unwind
             at src/libstd/panicking.rs:311
   9: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  10: core::result::unwrap_failed
             at src/libcore/result.rs:1084
  11: core::result::Result<T,E>::unwrap
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/libcore/result.rs:852
  12: uring::raft_node::RaftNode::on_ready
             at src/raft_node.rs:325
  13: uring::loopy_thing
             at src/main.rs:635
  14: uring::main::{{closure}}
             at src/main.rs:693

To Reproduce

I set up a mini demo to replicate:

  1. clone https://github.com/wayfair-incubator/uring/tree/2416031ac34759f002a9a1539b5a2a54bbd84946
  2. start node 1: cargo run -- -e 127.0.0.1:8081 -i 1
  3. wait for it to elect itself leader
  4. start node 2: cargo run -- -e 127.0.0.1:8082 -i 2 -p 127.0.0.1:8081
  5. wait for it to join the cluster
  6. start node 3: cargo run -- -e 127.0.0.1:8083 -i 3 -p 127.0.0.1:8081
  7. wait for it to join the cluster.
  8. Terminate node 1 (leader) CTRL+C
  9. node 2 or 3 will become leader
  10. restart node 1 and let it re-join the cluster cargo run -- -e 127.0.0.1:8081 -i 01 -p 127.0.0.1:8082

Expected behavior Stopping and starting nodes in a cluster should be handled gracefully

System information (probably not relevant)

  • CPU architecture: x86
  • Distribution and kernel version: OS X 10.14.6
  • SELinux on?: No
  • Any other system details we should know?: no

Additional context The shared repo is a minimal demo app trying to put raft-rs into a re-state for a reft cluster.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 41 (18 by maintainers)

Most upvoted comments

@Licenser If you don’t call curl -X POST 127.0.0.1:8081/node/1 in your reproduction case it seems to work without panicking… At least raft is trying to send the correct messages but it seems to be unable to.

image