openraft: Troubleshoot a Higher Vote Error

I recently upgraded from an earlier alpha to v0.6.4. I worked out most of the kinks, but I’m seeing an error like this in my integration tests after a new node joined the cluster.

 116.603607178s ERROR ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 116.603669397s  WARN ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 116.603761513s  INFO ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: ddx_core::membership: close time.busy=28.8ms time.idle=2.34s
 116.603844639s  INFO ThreadId(03) Node{id=0}:run_raft{node_id=0}: openraft::core: leader recv from replication_rx: "RevertToFollower: target: 2, vote: vote:2-2"

Do you have a rough idea of what could be causing this to help me troubleshoot? I want to configure the cluster to be as stable when possible (err on the side of no reelection unless the leader goes offline longer than the timeout).

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 22 (11 by maintainers)

Most upvoted comments

In this case, the append_entry heartbeats sent by the leader will fail. I expect the leader to catch up the follower when returning online, but not for other followers to trigger a re-election. Please let me know if I’m misunderstand the basic premise.

@fredfortier A crashed follower going online is totally OK. But a blocking RaftStorage is a different story: The follower created a timer when started to run, then blocked on RaftStorage, then it found the timer was triggered, as that time, the follower was believing the leader has gone.

Maybe there is a solution to this problem: When an election timeout event and an append-entries event are both triggered, a follower should only deal with the latter one. Let me see.

Do you mean RaftCore in the learner/follower node catching up, or for in leader node serving the snapshot? I assume the former but the latter would definitely be an issue.

Yes, I mean when installing the snapshot on the follower/learner side. The leader will send the snapshot data in another tokio task and will never block RaftCore.

drmingdrmer on Feb 21, 2022

It seems like node-2 entered the Candidate state and raised its term to elect itself.

Debug level logs will show more detail about what happened, e.g., something like timeout to recv a event, change to CandidateState. Do you have a debug level log to find out why this happened?

@ppamorim had a similar issue as yours. The replication to one node paused for some reason(too large a payload and skewed clock time on different nodes), with the latest branch main.

drmingdrmer on Feb 19, 2022