openraft: Improve behaviour during network failure / downed nodes
I noticed a few things on the behaviour for network issues that could simplify / improve the system in such situations and wanted to share them to start a discussion on the topic.
The first thing I noticed is that RaftNetworkFactory::connect returns Network not Result<Netowrk> this prevent’s the runtime from gracefully handling downed nodes.
The second thing I noticed is that when network errors occur, there is no backoff, meaning that in the case of an error, the system just keeps hammering the other side relentlessly.
This gave me the thought to perhaps treat the networking bit somewhat like erlang does for servers, meaning all calls become available, and errors will lead to eventual reinitialization and backoff.
This would allow moving the downed node behavior into the runtime and saving people from needing to implement reconnects etc., for themself.
So as a first thought:
If we make RaftNetworkFactory::connect failable, the raft engine can keep track of successful and unsuccessful connections. Unsuccessful connections can be re-initialized using a backoff strategy (exponential backoff with a limit, perhaps eventually a trait that could be passed to fine-tune this).
All errors in the network lead to the network state being set to downed, forcing a reconnect. (Not sure about this, this could also lead to problems perhaps some level of escalating grace period x errors for downed, or X errors in N time for downed would be betteR)
With those two things, the network becomes quite a bit easier to write, and a lot of the more complicated (and more error-prone) logic could be hoisted into the engine.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (17 by maintainers)
Commits related to this issue
- Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
- Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
- Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
- Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
- Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
@Licenser You’re correct, I hadn’t considered the possibility of error message flooding. In this case, the replication mod should be able to handle nodes that are unreachable.
Such an error will be returned to openraft in
RPCError: https://github.com/drmingdrmer/openraft/blob/17a658f41ba193609ee3ac626149ef50592d937c/openraft/src/error.rs#L245-L254My initial proposal is to allow an implementation of
RaftNetwork::send_append_entries()to return a designated error(Unreachable(NodeId)) that indicates a delay in retrying the operation. And let openraft enable users to define their own backoff algorithm or use a default one.Opinion?
That sounds like a great solution, that way the implementation can return
Unreachable(NodeId)if it is in “backoff mode” and then try to re-connect when it gets the next error! Brilliant 😄 I love it!We’ve implemented something similar to this:
We still hit the 50ms heartbeat timeout. Why would this wrapper let the app sleep? Doesn’t BackoffNetwork’s send_append_entries have its own 50ms timeout to deal with since it also implements the RaftNetwork trait?
Sorry for the late reply, the last few days have been hectic. That’s a really neat idea, I never thought about that kind of layering, and I really like it! It would still give timeout errors in the logs but it’s a good stop-gap 😃.
I’m following your discussion, since we have a similar problem in our project - we don’t want to send unnecessary messages (and we don’t even want to send empty ping messages, see below).
We have a different method for keeping “liveness” info of a link between two nodes. As long as the link is deemed “live” any consensus domains communicating via this link (we have replicas of multiple/many consensus domains hosted on the same node) we expect the message to ultimately go through. As soon as the “liveness” property is not guaranteed anymore, we can inform all affected consensus domains about the fact and thus for example let them start a new election after the timeout. I.e., instead of empty ping messages, we have external liveness check.
I didn’t look into the implementation in openraft yet, but this external liveness check could be likely integrated fairly simply instead of traditional heartbeat. Then, as soon as the link is deemed down by the external agent (which can do for example its own heartbeat on behalf of multiple consensus domains), we can tell all affected local openraft replicas, so they can start their “usual” handling of replication timeout (and start an election on a follower).
@Licenser Maybe you could live with something like this as well? Not sure about your use case. But, as I wrote, this is in a concept phase and it isn’t implemented yet, we wanted to start with simple openraft and initially use UDP directly to transport messages.
After a night, it feels
RaftNetworkFactory::connect()returning aResultwould be better. As you said, there may be misconfigured node addresses being used. In such cases, the raft has a chance to discover it early. But still, the raft can do nothing about such an error, except print an error log.Using
Resultinstead ofpanic()will let the cluster keep working even when there is only a minority of misconfigured nodes.