bazel: Builds without the Bytes fails on missing AC result

[Not to be confused with available AC result referencing missing blobs in CAS]

The original proposal is clear about this limitation and calls it “initially acceptable”:

It’s possible that the remote system and the cached state on the local system become out of sync and the design needs to be robust enough to handle it. The interesting case is when Bazel has metadata about a remote output file being cached but the file was evicted from the remote system. Whether or not a file exists remotely is relevant if an executing action declares it as an input file. The ideal user experience would arguably be for Bazel to re-execute the generating action of the evicted output and transitively all its dependants, however that will be a large change that’s out of scope for this proposal and we argue that an acceptable initial behaviour is to fail with a meaningful error message, ask the user to run bazel clean and to re-run the command. We argue that this behaviour is initially acceptable because we expect that output files will only infrequently be evicted from a remote execution system.

However most discussions (e.g. in #8250) focus only on using a remote cache that is not returning AC result if blobs are missing in CAS. But that is NOT enough to adress the following scenario:

genrule(
    name = "a",
    srcs = ["a.in"],
    outs = ["a.out"],
    cmd = "cat $(SRCS) > $@",
)

genrule(
    name = "b",
    srcs = ["a.out", "b.in"],
    outs = ["b.out"],
    cmd = "cat $(SRCS) > $@",
)
# Prepare source files
echo a1 > a.in
echo b1 > b.in

# Populate remote cache (emulated with --disk_cache)
bazel clean;
bazel build :b --disk_cache=./diskcache

# Builds without the bytes is downloading b.out but not a.out.
bazel clean;
bazel build :b --remote_download_outputs=toplevel --experimental_inmemory_jdeps_files --experimental_inmemory_dotd_files --disk_cache=./diskcache

# Remove both CAS and AC result from remote cache
rm -rf diskcache

# Trigger re-build that needs a.out (without bazel clean or bazel shutdown)
echo b2 > b.in

bazel build :b --remote_download_outputs=toplevel --experimental_inmemory_jdeps_files --experimental_inmemory_dotd_files --disk_cache=./diskcache

Results in the following error:

ERROR: /home/ulrik/tmp/bazel_without_all_the_bytes_test/BUILD:10:1: Executing genrule //:b failed due to unexpected I/O exception: Failed to fetch file with hash '0111f7554519f7126c570c154b894f1fbcddf4faa126f6d644b974dab6c77411' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '0111f7554519f7126c570c154b894f1fbcddf4faa126f6d644b974dab6c77411' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:128)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:206)
at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:209)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:89)
at com.google.devtools.build.lib.actions.SpawnActionContext.beginExecution(SpawnActionContext.java:41)
at com.google.devtools.build.lib.exec.ProxySpawnActionContext.beginExecution(ProxySpawnActionContext.java:60)
at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)
at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:931)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1070)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1041)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:116)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:77)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:608)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:903)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:297)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:399)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

Example of scenarios where AC results could go missing:

  • Cache eviction.
  • Local --disk_cache storage cleared.
  • Remote cache down and replaced with other non-fully synchronized instance.
  • Load balancing between alternative cache instances, e.g. with asynhronous replication.
  • Reconfigured sharding.
  • Remote RAM cache restarted.

I’m not happy with workarounds involving manual ‘bazel clean’, since I want users to trust the build system and not fall back to old habits of ‘make clean’.

There have been debates about support for rewinding actions to resolve this. Are there any decision @buchgr and @ulfjack?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 11
  • Comments: 20 (19 by maintainers)

Most upvoted comments

@tjgq Per this:

However most discussions (e.g. in https://github.com/bazelbuild/bazel/issues/8250) focus only on using a remote cache that is not returning AC result if blobs are missing in CAS. But that is NOT enough to address the following scenario:

I don’t agree that it’s a duplicate. It’s useful to keep this open, since I’ve been linking to it in various discussions, and it’s only really fixed by the lease service.

Still exporting some internal code to BwoB and will look into this after that (probably next week).

I will work on the fix for 2 in a few days.

@illicitonion, I notice you are working on https://github.com/illicitonion/bazel/commits/rewinding-bulk-upload-exceptions-5.0. What is the current state? Are you planning sending a pull request?

I have a PR open at https://github.com/bazelbuild/bazel/pull/14126 that’s been pending review for a few months - I will rebase it soon.