openraft: Troubleshoot a Higher Vote Error

I recently upgraded from an earlier alpha to v0.6.4. I worked out most of the kinks, but I’m seeing an error like this in my integration tests after a new node joined the cluster.

 116.603607178s ERROR ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 116.603669397s  WARN ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
 116.603761513s  INFO ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: ddx_core::membership: close time.busy=28.8ms time.idle=2.34s
 116.603844639s  INFO ThreadId(03) Node{id=0}:run_raft{node_id=0}: openraft::core: leader recv from replication_rx: "RevertToFollower: target: 2, vote: vote:2-2"

Do you have a rough idea of what could be causing this to help me troubleshoot? I want to configure the cluster to be as stable when possible (err on the side of no reelection unless the leader goes offline longer than the timeout).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22 (11 by maintainers)

Most upvoted comments

In this case, the append_entry heartbeats sent by the leader will fail. I expect the leader to catch up the follower when returning online, but not for other followers to trigger a re-election. Please let me know if I’m misunderstand the basic premise.

@fredfortier A crashed follower going online is totally OK. But a blocking RaftStorage is a different story: The follower created a timer when started to run, then blocked on RaftStorage, then it found the timer was triggered, as that time, the follower was believing the leader has gone.

Maybe there is a solution to this problem: When an election timeout event and an append-entries event are both triggered, a follower should only deal with the latter one. Let me see.

Do you mean RaftCore in the learner/follower node catching up, or for in leader node serving the snapshot? I assume the former but the latter would definitely be an issue.

Yes, I mean when installing the snapshot on the follower/learner side. The leader will send the snapshot data in another tokio task and will never block RaftCore.

It seems like node-2 entered the Candidate state and raised its term to elect itself.

Debug level logs will show more detail about what happened, e.g., something like timeout to recv a event, change to CandidateState. Do you have a debug level log to find out why this happened?

@ppamorim had a similar issue as yours. The replication to one node paused for some reason(too large a payload and skewed clock time on different nodes), with the latest branch main.