habitat: Elections can hang when electorate grows

tl;dr

Elections in habitat require a quorum of alive members in order to reach completion. When members join a service group, they drive the size of the total electorate higher, meaning you need more alive members to reach quorum. The result is elections that fail to complete, for lack of quorum, when they “should” complete from a human perspective.

tl;dr - the fix

There is no automatic way to resolve this use case. We propose implementing a command that allows an administrator to inform a service group that it should mark all confirmed members of the service group as no longer part of the electorate.

The details

This quorum is calculated with two inputs - the electorate (the total number of members who have the service group running) and the alive members (the number of members who have the service group running and are also marked as ‘alive’).

To understand the problem, lets start with the minimum sized valid electorate of 3:

Member Status Leader
1 Alive N
2 Alive N
3 Alive Y

Now, lets say member 3 goes down:

Member Status Leader
1 Alive N
2 Alive N
3 Confirmed NA

Under the hood, we have an electorate of 3, and an alive-members of 2. The formula for a quorum is alive_members >= ((electorate/2)+1) - so we wind up with an election, and the state is now:

Member Status Leader
1 Alive N
2 Alive Y
3 Confirmed NA

Now, time goes by - and our administration team helpfully launches another member to take the place of dear-departed member 3.

Member Status Leader
1 Alive N
2 Alive Y
3 Confirmed NA
4 Alive N

Suddenly, the difficulty of being a leader amongst the ruffians in the service group gets to be too much for member 2:

Member Status Leader
1 Alive N
2 Confirmed NA
3 Confirmed NA
4 Alive N

So we now have 2 >= ((4/2)+1), or 2 >= 3, and we fail to achieve quorum - there is no way to know if the condition we are in is a 2x2 net split . So we add a few more members, because we don’t want to be in this situation again:

Member Status Leader
1 Alive Y
2 Confirmed NA
3 Confirmed NA
4 Alive N
5 Alive N
6 Alive N
7 Alive N

Everything is fine - 5 >= 4, so we have quorum. Suddenly, a we loose 4 members:

Member Status Leader
1 Confirmed NA
2 Confirmed NA
3 Confirmed NA
4 Confirmed NA
5 Confirmed NA
6 Confirmed NA
7 Alive N

Thinking we know what to do, we add 4 more:

Member Status Leader
1 Confirmed NA
2 Confirmed NA
3 Confirmed NA
4 Confirmed NA
5 Confirmed NA
6 Confirmed NA
7 Alive N
8 Alive N
9 Alive N
10 Alive N
11 Alive N

And arithmetic rears its ugly head - 5 >= 6, and we fail again. And so it goes. Obviously, if we kept trying to just have a cluster size of 3, this happens much sooner.

Mitigations we discussed but dismissed

Persistent Supervisors

The work to have supervisors be persistent helps here, in that it’s much harder to get into this situation if the members are actually coming back. They’ll pick up where they left off, mark themselves alive again, and the electorate doesn’t inflate arbitrarily. It doesn’t solve the problem, though - as members die permanently, you’ll eventually have to raise the size of the service group to keep up.

Garbage Collection

We can garbage collect confirmed members. This is probably a good idea, but the question here is - after how long? Any aggressive setting runs the risk of creating a scenario where short-lived net splits protect you from a split brain, but long-lived ones (longer than whatever your threshold is) do not. This is unacceptable, since a split-brain scenario is the worst case. I think we probably do implement garbage collection, but on a threshold somewhere in the 72-96 hour range.

The proposed fix

We can add a mechanism for informing a service group that they should consider all currently confirmed members as no longer valid parts of the electorate. Upon receipt of the rumor, a member would mark all of its confirmed members as having service entires that are no longer part of the electorate.

hab sup mark confirmed redis.default

In the meantime

We recommend using the persistent supervisors when in concert with the leader topology, and working hard to bring them back from the dead.

Thanks

This issue was first recognized by @moretea, who helpfully created a video and a gist on how to reproduce:

https://gist.github.com/moretea/f69f71b342a1d10d7ee05e52dafffd6a https://www.youtube.com/watch?v=cVIH87B-ZO0

Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 17 (13 by maintainers)

Commits related to this issue

Most upvoted comments

For any future implementors:

  1. Members need to have a ‘departed’ field added.
  2. On graceful shutdown, we set our member rumor to departed and share the rumor somewhat aggressively. This will delay our shutdown somewhat.
  3. Any member, departed or not, should be garbage collected after 72 hours.
  4. Any member can be forced to depart with hab-butterfly member depart MEMBERID
  5. When a quorum fails to be reached, we should print a message telling you what members you could force to depart to restore quorum.