bazel: Important race condition when using remote cache without remote build

Description of the problem / feature request / question:

Our org has about 50 engineers using Bazel, with a remote cache but without remote build. About once every 3-4 weeks, I would hear of a case of someone’s build that was failing in strange ways when the remote cache was turned on, but would succeed when the cache was off.

I finally got a repro case, and was able to figure out what was happening. Bazel’s remote spawn strategy does its work in this order:

  1. Compute the hash of the action (which includes the hash of all the input files)
  2. Check if the remote cache contains the action result matching that hash
  3. If it does, then download it
  4. If it doesn’t, then do a local build and upload the action result

But if the user makes any changes to the sources (e.g. saves a file, or uses git to check out a different branch of their code) after step 1, but before the build in step 3, then you will end up with an incorrect (corrupt) entry in the cache: Its hash matches the input files when step 1 executed, but its contents do not match what would have been built from those input files — its contents match what was built from the modified input files.

From that point on, all other developers on the team who try to build that same code will download the incorrect build results.

Interestingly, I think this cannot cause cache problems if you are using remote cache WITH remote build. In that scenario, Bazel’s SimpleBlobStoreActionCache.uploadFileContents() will first read the file into memory, then compute its hash, and upload the already in-memory blob. So the file contents in the cache are guaranteed to be accurate (the source file’s hash matches the source file’s contents). I haven’t tried it, but I imagine that the worst that could happen — and I’m not even sure if this could happen — is that the remote build might fail because it can’t find the input files it wants. No serious harm done.

If possible, provide a minimal example to reproduce the problem:

This BUILD file has a rule called :race which, when built, will first sleep for a few seconds, and then will simply cat foo.in to foo.out:

genrule(
	name = "race",
	srcs = ["foo.in"],
	outs = ["foo.out"],
	cmd = "sleep 10; cat $(SRCS) > $@"
)

Make a file foo.in: echo 1 > foo.in

Start a build with a command line similar to this one (for this test, I have nginx running on localhost acting as a remote cache; you will need to set up some sort of similar remote cache):

bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build --spawn_strategy=remote --remote_rest_cache=http://localhost:8000/bazel-cache/race :race

While the build is happening — during the sleep 10 — in another shell, change the contents of foo.in: echo 2 > foo.in

The build will complete, and bazel-genfiles/foo.out will, not surprisingly, contain 2. That’s not a bug. But then:

  • Do a clean: bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean
  • Reset foo.in to its old contents: echo 1 > foo.in
  • Do another bazel build, with the same command line as above.

At this point, bazel-genfiles/foo.out will contain 2, because that’s what Bazel found in the remote cache. This is definitely wrong.

Environment info

  • Operating System:

OSX; probably on Linux too, I haven’t tested it

  • Bazel version (output of bazel info release):

0.5.2

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 15
  • Comments: 61 (54 by maintainers)

Commits related to this issue

Most upvoted comments

We can force which bazel version is used. But we can’t force or verify that each developer add the extra flag when needed. We are all humans, so it is normal that sometimes mistake will be made and the flag will be forgotten… And when this happens, the consequence is important and lose much humans time before being corrected.

At this point, I think we need someone to justify a big speed regression to justify tolerating a cache corruption problem. Something like the speed without/wo the flag. Without that, I think we should just change the default. Right now, the speed regression looks more an anecdote then a big real issue. (seem to be reported by only 1 person without any numbers).

The cache poisoning is a big real issue that affected many people.

That’s awesome, thank you for reporting it!

I agree, the default should be safe. For people that can guarantee the source file doesn’t change, giving them a speed up knob can be useful. But it should be safe by default.

In FileContentsProxy via FileArtifactValue.

The check is already implemented, though. Just enable

--experimental_guard_against_concurrent_changes

It feels like there’s a “{Fast, Correct} - Choose two” joke in here somewhere 😃

This page (https://docs.bazel.build/versions/master/remote-caching.html#known-issues) say. " Input file modification during a build

When an input file is modified during a build, Bazel might upload invalid results to the remote cache. You can enable a change detection with the --experimental_guard_against_concurrent_changes flag. There are no known issues and it will be enabled by default in a future release. See issue #3360 for updates. Generally, avoid modifying source files during a build. "

Can it be enabled by default now as the bold text tell it will happens? People that do not want it can disable this. We need the cache to be safe by default.

If this context helps, I can tell you that I don’t know how we would use Bazel if this flag wasn’t available. (We are a Mac shop.) Before this flag existed, corruption of our remote cache was somewhat rare, but when it did happen — perhaps once a month — we were pretty mystified and would have to spend a long time trying to figure out what happened and what to do about it.

What’s the status of this?

No, I get it, and have read. But you’re seeing it manifest as serious performance issue for you, @uri-canva, worse than the cache poisoning? Any more details you can provide (here or in the other issue)?

What I was seeing on my repo/machines was that it’s trading off a minor performance issue for serious correctness/cache-poisoning issue–and human time is much more valuable than CPU time. Seems like some others maybe feel that way…but it’s totally possible we’re not hitting the issue the same way you are and are missing the severity of the trade off.

@buchgr @ulfjack Does anyone remember why this flag isn’t enabled by default? 😅 Is it because of the “There may be cases where the Linux kernel delays writing of files, which could cause false positives.” part in the documentation of the flag?

The 0.5.3 release will contain a bug fix by @ola-rozenfeld that should lower the chances of hitting the remote cache race by checking the input files ctime before and after execution. However, as @ulfjack explained to me, this will not fix the bug. The problem is that we need to compute the input’s ctime at the time the action’s hash is computed, which is usually much before the action is executed. So this fix still doesn’t address this problem, but just reduces the time window it can. We don’t know whether this is significant tough. @ulfjack mentioned we need to do some internal refactoring to be able to fix it for real. This will NOT be done in time for the 0.5.3 release.