openraft: Troubleshoot a Higher Vote Error
I recently upgraded from an earlier alpha to v0.6.4. I worked out most of the kinks, but I’m seeing an error like this in my integration tests after a new node joined the cluster.
116.603607178s ERROR ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
116.603669397s WARN ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: openraft::replication: error replication to target=2 error=seen a higher vote: vote:2-2 GT mine: vote:1-0
116.603761513s INFO ThreadId(03) Node{id=0}:spawn{service_name=MembershipManager}:apply_membership: ddx_core::membership: close time.busy=28.8ms time.idle=2.34s
116.603844639s INFO ThreadId(03) Node{id=0}:run_raft{node_id=0}: openraft::core: leader recv from replication_rx: "RevertToFollower: target: 2, vote: vote:2-2"
Do you have a rough idea of what could be causing this to help me troubleshoot? I want to configure the cluster to be as stable when possible (err on the side of no reelection unless the leader goes offline longer than the timeout).
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 22 (11 by maintainers)
@fredfortier A crashed follower going online is totally OK. But a blocking RaftStorage is a different story: The follower created a timer when started to run, then blocked on RaftStorage, then it found the timer was triggered, as that time, the follower was believing the leader has gone.
Maybe there is a solution to this problem: When an election timeout event and an append-entries event are both triggered, a follower should only deal with the latter one. Let me see.
Yes, I mean when installing the snapshot on the follower/learner side. The leader will send the snapshot data in another tokio task and will never block RaftCore.
It seems like node-2 entered the Candidate state and raised its term to elect itself.
Debug level logs will show more detail about what happened, e.g., something like
timeout to recv a event, change to CandidateState. Do you have a debug level log to find out why this happened?@ppamorim had a similar issue as yours. The replication to one node paused for some reason(too large a payload and skewed clock time on different nodes), with the latest branch
main.