bazel: remote/performance: If remote cache is inaccessible, fall back to building without the cache, rather than failing the build
Description of the problem / feature request / question:
Feature request:
I was able to get caching to work with --spawn_strategy=remote --rest_cache_url=...
. It works well; but if the cache is inaccessible for any reason (e.g. I have gone offline and am working while commuting, or the server has gone down), then my builds fail.
Of course, I can change the options I’m using to launch Bazel; but that isn’t always a good option. For one thing, my company has quite a lot of developers, and I would prefer that they not all have to learn this workaround. Secondly, in our automated Jenkins builds, launching with different command-line arguments isn’t an option.
What I have hacked together for our own use is some changes to Bazel so that:
- Each time an error occurs trying to read or write the remote cache, it displays a short warning message, but continues the build. (
get
operations pretend the item was not found in the cache;put
operations pretend the operation succeeded.) - After ten consecutive such errors with no intervening successful cache accesses, Bazel displays a message that says, “Cache encountered multiple consecutive errors; disabling cache for 5 minutes.”
My code for this was pretty quick-and-dirty, so it’s not really in a shareable state, but it was pretty easy to write.
It would make sense for this to work regardless of what remote caching protocol is used – REST API, gRPC, etc.
Environment info
- Bazel version (output of
bazel info release
):
0.4.5
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 9
- Comments: 24 (16 by maintainers)
Commits related to this issue
- don't upload to the remote cache if downloading failed https://github.com/bazelbuild/bazel/issues/2964 — committed to dropbox/bazel by benjaminp 7 years ago
- don't upload to the remote cache if downloading failed https://github.com/bazelbuild/bazel/issues/2964 — committed to dropbox/bazel by benjaminp 7 years ago
- don't upload to the remote cache if downloading failed https://github.com/bazelbuild/bazel/issues/2964 — committed to dropbox/bazel by benjaminp 7 years ago
- don't upload to the remote cache if downloading failed https://github.com/bazelbuild/bazel/issues/2964 — committed to dropbox/bazel by benjaminp 7 years ago
- remote: don't fail build if upload fails If the upload of local build artifacts fails, the build no longer fails but instead a warning is printed once. If --verbose_failures is specified, a detailed ... — committed to bazelbuild/bazel by benjaminp 7 years ago
- Fall back to building without remote if remote server is inaccessible before build. In the case where gRPC remote mode is enabled (gRPC remote cache or remote execution), Bazel will check the server ... — committed to bazelbuild/bazel by coeuvre 4 years ago
@benjaminp has prepared a change [1] that once merged, will not attempt to upload to remote cache if the lookup failed. So that solves half the problem. There will be retries, making the build really slow.
However, you can set
--experimental_remote_retry=false
and then the build should be quick even if the remote cache is down once in a while.I already have an idea of how to make this happen with retries, but it will take a while until I have time to implement it.
[1] https://bazel-review.googlesource.com/c/15070
We’ve improved the error handling for 0.5.3 (upcoming). I think it’s correctly falling back now.
I have a prototype implementation of a circuit breaker in our retry logic [1]. Will share soon.
[1] https://martinfowler.com/bliki/CircuitBreaker.html
@mmorearty Your analysis is spot on.
@ulfjack @mmorearty When
--remote_upload_local_results
is enabled, I propose to print a warning (one in total) if one or more uploads fail instead of failing the build (see also #3368).There’s another problem, however. Currently, if the remote cache lookup fails we retry with exponential fallback. If the remote cache is down, this adds a 6 second retry period per action before we even attempt to build locally. So that will slow down any build significantly. It’s possible to disable the retry mechanism via a command line flag, but at some point combining all these flags gets too complicated and I am also not sure if we want to make this flag stable eventually (it’s experimental right now).
I think we should either not make this error retryable or introduce a mechanism that we just stop retrying after N retries failed with such an error.