openraft: How to handle and remove unreachable nodes

It seems that currently - should a node disconnect abruptly, and we did not remove the node explicitly beforehand - the leader “spins” forever, retrying the request repeatedly.

I tried sending returning both an RPCError::Timeout and RPCError::Network from my implementation, but it seems that there’s no upper limit currently, and it just retries the request forever. This means that the leader cannot call remove_member after the fact (even in that case, it will fail the preflight is_leader() check).

I tried to return RPCError::NodeNotFound, which resulted in a panic.

Am I doing something wrong? Is there a proper way to remove nodes that can no longer be reached?

(sidenote: I am on the current main branch if that is of any significance.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (7 by maintainers)

Commits related to this issue

Most upvoted comments

I just tested the fix, and it works! Thanks

I have compiled a full log here: https://gist.github.com/indietyp/f899ef6b4e0ba10c4d4be987d6f12692 (it is quite log)

I will try to upload a complete log via gist (I hope that’s ok) as soon as possible.