google-cloud-node: bigtable: retry partially failed reads and writes

My calls to bigtable getRows() fail intermittently, so I have had to wrap all of these methods in retry blocks. I was wondering:

  1. Is this flakiness expected?
  2. Have you considered adding a retry mechanism to these RPCs?

Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 22 (21 by maintainers)

Most upvoted comments

FWIW, this behavior (partial failure in batch operations) is expected from the Bigtable perspective. A bulk read or write operation can affect many rows, and what can happen is that some of the reads or writes will succeed, while others may fail (because different parts of the bulk request may go to different backing Bigtable servers, of which some may be busy, unavailable, or simply timeout) — Bigtable does not provide atomicity guarantees across multiple rows, so any single operation within the batch can succeed or fail independently of any others.

However, these are typically not permanent errors, so they should be retried, but as an optimization, rather than retrying the entire batch request, the client library needs to iterate over the response statuses, and only retry the ones that were marked as having failed or timed out. This is precisely what we do in other Bigtable client libraries.

The upside is that even with the occasional retries, the overall performance is much higher than with a single read or write operation per API call.

/cc: @garye, @sduskis

@arbesfeld sorry, we’ve been pretty busy with other items, but I’m going to try and get on this within the next week or so.

You are free to implement this in your application, but it’s something we will eventually support in this library.