bookkeeper: Deferred failure handling can cause data loss
The bookkeeper client has a feature where if you have a ledger with a write quorum(Qw) larger than an ack quorum(Qa), such as 3:3:2, if (Qw-Qa) bookies return an error, after the entry add has completed, the erroring bookie will be replaced in the ensemble in the background.
This can cause data loss.
Consider a 3:3:2 ledger. Assume zookeeper is not accepting writes.
Start ensemble is b1,b2,b3
- e0 is written to b1,b2,b3
- b2 & b3 acknowledge, e0 request completes
- b1 responds with and error and is added to the delayed ensemble change list
- e1 is written, delayed error handling kicks off, b1 is replaced, so e1 is written to b4,b2,b3 while the client tries to update the metadata in zookeeper. zookeeper is stalled so fails to respond.
- repeat this sequence for each of b2 and b3.
As each bookie fails, it will be replaced and writes will be acknowledged. Eventually the ensembles will look something like
0: b1,b2,b3
1: b4,b2,b3
3: b4,b5,b3
4: b4,b5,b6
How ever, this is only local, zookeeper still only has the initial ensemble. So, even though all entries from 4 onwards are acknowledged successfully to the client, if another client comes to read the ledger, they will not see them. If the other client recovers the ledger, the data is lost. TOAB violation.
Here’s a test case which triggers the issue: https://github.com/ivankelly/bookkeeper/blob/15bc251d46d5cd5fcceef130c0046eeacbe446cc/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/TestDeferredFailure.java
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (14 by maintainers)
Commits related to this issue
- Delayed write ensemble change may cause dataloss Descriptions of the changes in this PR: The Original intent of this change is to do a best-effort ensemble change. But this is not possible until the... — committed to apache/bookkeeper by jvrao 6 years ago
- Delayed write ensemble change may cause dataloss Descriptions of the changes in this PR: The Original intent of this change is to do a best-effort ensemble change. But this is not possible until the... — committed to apache/bookkeeper by jvrao 6 years ago
- Dataloss w 5344681 (#149) * Avoid releasing sent buffer to early in BookieClient mock If the buffer was sent to more than one bookie with the mock, it would be released after being sent to the fi... — committed to reddycharan/bookkeeper by jvrao 6 years ago
I think a quick fix is handleDelayedWriteBookieFailure should increment
blockAddCompletions
to block completions.