openraft: Improve behaviour during network failure / downed nodes

I noticed a few things on the behaviour for network issues that could simplify / improve the system in such situations and wanted to share them to start a discussion on the topic.

The first thing I noticed is that RaftNetworkFactory::connect returns Network not Result<Netowrk> this prevent’s the runtime from gracefully handling downed nodes.

The second thing I noticed is that when network errors occur, there is no backoff, meaning that in the case of an error, the system just keeps hammering the other side relentlessly.

This gave me the thought to perhaps treat the networking bit somewhat like erlang does for servers, meaning all calls become available, and errors will lead to eventual reinitialization and backoff.

This would allow moving the downed node behavior into the runtime and saving people from needing to implement reconnects etc., for themself.

So as a first thought:

If we make RaftNetworkFactory::connect failable, the raft engine can keep track of successful and unsuccessful connections. Unsuccessful connections can be re-initialized using a backoff strategy (exponential backoff with a limit, perhaps eventually a trait that could be passed to fine-tune this).

All errors in the network lead to the network state being set to downed, forcing a reconnect. (Not sure about this, this could also lead to problems perhaps some level of escalating grace period x errors for downed, or X errors in N time for downed would be betteR)

With those two things, the network becomes quite a bit easier to write, and a lot of the more complicated (and more error-prone) logic could be hoisted into the engine.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 21 (17 by maintainers)

Commits related to this issue

Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago
Feature: add backoff strategy for unreachable nodes Implements a backoff strategy for temporarily or permanently unreachable nodes. If the `Network` implementation returns `Unreachable` error, Openra... — committed to drmingdrmer/openraft by drmingdrmer a year ago

Most upvoted comments

This is where the idea of a timeout came from, for an operator there is no benefit of seeing 20 messages/second of RPCError err=timeout after 50ms when AppendEntries and for a network perspective there is no benefit of re-trying a connection 20 times a second if it’s down.

@Licenser You’re correct, I hadn’t considered the possibility of error message flooding. In this case, the replication mod should be able to handle nodes that are unreachable.

Such an error will be returned to openraft in RPCError: https://github.com/drmingdrmer/openraft/blob/17a658f41ba193609ee3ac626149ef50592d937c/openraft/src/error.rs#L245-L254

My initial proposal is to allow an implementation of RaftNetwork::send_append_entries() to return a designated error(Unreachable(NodeId)) that indicates a delay in retrying the operation. And let openraft enable users to define their own backoff algorithm or use a default one.

Opinion?

drmingdrmer on Apr 19, 2023

That sounds like a great solution, that way the implementation can return Unreachable(NodeId) if it is in “backoff mode” and then try to re-connect when it gets the next error! Brilliant 😄 I love it!

Licenser on Apr 19, 2023

We’ve implemented something similar to this:

struct BackoffNetwork<N: RaftNetwork> {
    backoff: Instant, 
    inner: N, 
}

impl<N> RaftNetwork for BackoffNetwork<N> {
    fn send_append_entries(&mut self, req) -> Result<> {
        if Instant::now() < self.backoff {
            sleep(100000000).await;
        }

        let res = self.inner.send_append_entries(req).await;
        if let RPCError::Backoff(sleep_time) = 7res {
            self.backoff = Instant::now() + sleep_time;
        } else {
            // no need to backoff
            self.backoff = Instant::now();
        }
        res
    }
}

We still hit the 50ms heartbeat timeout. Why would this wrapper let the app sleep? Doesn’t BackoffNetwork’s send_append_entries have its own 50ms timeout to deal with since it also implements the RaftNetwork trait?

kevlu93 on Dec 2, 2022

Sorry for the late reply, the last few days have been hectic. That’s a really neat idea, I never thought about that kind of layering, and I really like it! It would still give timeout errors in the logs but it’s a good stop-gap 😃.

Licenser on Jul 26, 2022

I’m following your discussion, since we have a similar problem in our project - we don’t want to send unnecessary messages (and we don’t even want to send empty ping messages, see below).

We have a different method for keeping “liveness” info of a link between two nodes. As long as the link is deemed “live” any consensus domains communicating via this link (we have replicas of multiple/many consensus domains hosted on the same node) we expect the message to ultimately go through. As soon as the “liveness” property is not guaranteed anymore, we can inform all affected consensus domains about the fact and thus for example let them start a new election after the timeout. I.e., instead of empty ping messages, we have external liveness check.

I didn’t look into the implementation in openraft yet, but this external liveness check could be likely integrated fairly simply instead of traditional heartbeat. Then, as soon as the link is deemed down by the external agent (which can do for example its own heartbeat on behalf of multiple consensus domains), we can tell all affected local openraft replicas, so they can start their “usual” handling of replication timeout (and start an election on a follower).

@Licenser Maybe you could live with something like this as well? Not sure about your use case. But, as I wrote, this is in a concept phase and it isn’t implemented yet, we wanted to start with simple openraft and initially use UDP directly to transport messages.

schreter on Jul 21, 2022

After a night, it feels RaftNetworkFactory::connect() returning a Result would be better. As you said, there may be misconfigured node addresses being used. In such cases, the raft has a chance to discover it early. But still, the raft can do nothing about such an error, except print an error log.

Using Result instead of panic() will let the cluster keep working even when there is only a minority of misconfigured nodes.

drmingdrmer on Jul 21, 2022