bazel: remote/performance: If remote cache is inaccessible, fall back to building without the cache, rather than failing the build

Description of the problem / feature request / question:

Feature request:

I was able to get caching to work with --spawn_strategy=remote --rest_cache_url=.... It works well; but if the cache is inaccessible for any reason (e.g. I have gone offline and am working while commuting, or the server has gone down), then my builds fail.

Of course, I can change the options I’m using to launch Bazel; but that isn’t always a good option. For one thing, my company has quite a lot of developers, and I would prefer that they not all have to learn this workaround. Secondly, in our automated Jenkins builds, launching with different command-line arguments isn’t an option.

What I have hacked together for our own use is some changes to Bazel so that:

  • Each time an error occurs trying to read or write the remote cache, it displays a short warning message, but continues the build. (get operations pretend the item was not found in the cache; put operations pretend the operation succeeded.)
  • After ten consecutive such errors with no intervening successful cache accesses, Bazel displays a message that says, “Cache encountered multiple consecutive errors; disabling cache for 5 minutes.”

My code for this was pretty quick-and-dirty, so it’s not really in a shareable state, but it was pretty easy to write.

It would make sense for this to work regardless of what remote caching protocol is used – REST API, gRPC, etc.

Environment info

  • Bazel version (output of bazel info release):

0.4.5

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 9
  • Comments: 24 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@benjaminp has prepared a change [1] that once merged, will not attempt to upload to remote cache if the lookup failed. So that solves half the problem. There will be retries, making the build really slow.

However, you can set --experimental_remote_retry=false and then the build should be quick even if the remote cache is down once in a while.

I already have an idea of how to make this happen with retries, but it will take a while until I have time to implement it.

[1] https://bazel-review.googlesource.com/c/15070

We’ve improved the error handling for 0.5.3 (upcoming). I think it’s correctly falling back now.

I have a prototype implementation of a circuit breaker in our retry logic [1]. Will share soon.

[1] https://martinfowler.com/bliki/CircuitBreaker.html

@mmorearty Your analysis is spot on.

@ulfjack @mmorearty When --remote_upload_local_results is enabled, I propose to print a warning (one in total) if one or more uploads fail instead of failing the build (see also #3368).

There’s another problem, however. Currently, if the remote cache lookup fails we retry with exponential fallback. If the remote cache is down, this adds a 6 second retry period per action before we even attempt to build locally. So that will slow down any build significantly. It’s possible to disable the retry mechanism via a command line flag, but at some point combining all these flags gets too complicated and I am also not sure if we want to make this flag stable eventually (it’s experimental right now).

I think we should either not make this error retryable or introduce a mechanism that we just stop retrying after N retries failed with such an error.