bazel: Important race condition when using remote cache without remote build

Description of the problem / feature request / question:

Our org has about 50 engineers using Bazel, with a remote cache but without remote build. About once every 3-4 weeks, I would hear of a case of someone’s build that was failing in strange ways when the remote cache was turned on, but would succeed when the cache was off.

I finally got a repro case, and was able to figure out what was happening. Bazel’s remote spawn strategy does its work in this order:

Compute the hash of the action (which includes the hash of all the input files)
Check if the remote cache contains the action result matching that hash
If it does, then download it
If it doesn’t, then do a local build and upload the action result

But if the user makes any changes to the sources (e.g. saves a file, or uses git to check out a different branch of their code) after step 1, but before the build in step 3, then you will end up with an incorrect (corrupt) entry in the cache: Its hash matches the input files when step 1 executed, but its contents do not match what would have been built from those input files — its contents match what was built from the modified input files.

From that point on, all other developers on the team who try to build that same code will download the incorrect build results.

Interestingly, I think this cannot cause cache problems if you are using remote cache WITH remote build. In that scenario, Bazel’s SimpleBlobStoreActionCache.uploadFileContents() will first read the file into memory, then compute its hash, and upload the already in-memory blob. So the file contents in the cache are guaranteed to be accurate (the source file’s hash matches the source file’s contents). I haven’t tried it, but I imagine that the worst that could happen — and I’m not even sure if this could happen — is that the remote build might fail because it can’t find the input files it wants. No serious harm done.

If possible, provide a minimal example to reproduce the problem:

This BUILD file has a rule called :race which, when built, will first sleep for a few seconds, and then will simply cat foo.in to foo.out:

genrule(
	name = "race",
	srcs = ["foo.in"],
	outs = ["foo.out"],
	cmd = "sleep 10; cat $(SRCS) > $@"
)

Make a file foo.in: echo 1 > foo.in

Start a build with a command line similar to this one (for this test, I have nginx running on localhost acting as a remote cache; you will need to set up some sort of similar remote cache):

bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build --spawn_strategy=remote --remote_rest_cache=http://localhost:8000/bazel-cache/race :race

While the build is happening — during the sleep 10 — in another shell, change the contents of foo.in: echo 2 > foo.in

The build will complete, and bazel-genfiles/foo.out will, not surprisingly, contain 2. That’s not a bug. But then:

Do a clean: bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean
Reset foo.in to its old contents: echo 1 > foo.in
Do another bazel build, with the same command line as above.

At this point, bazel-genfiles/foo.out will contain 2, because that’s what Bazel found in the remote cache. This is definitely wrong.

Environment info

Operating System:

OSX; probably on Linux too, I haven’t tested it

Bazel version (output of bazel info release):

0.5.2

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 15
Comments: 61 (54 by maintainers)

Commits related to this issue

Don't upload corrupt artifacts to remote cache Workaround for https://github.com/bazelbuild/bazel/issues/3360. After an action finishes executing, we then re-check the timestamps of all of the input ... — committed to Asana/bazel by mmorearty 7 years ago
remote: Lower the chances of a race condition in the remote upload. Lower the chances of triggering #3360. Immediately before uploading output artifacts, check if any of the inputs ctime changed, and... — committed to bazelbuild/bazel by deleted user 7 years ago
Avoid a race condition in the remote upload Avoid triggering #3360. Immediately before a file's digest is computed for the first time, cache its last-modified time (mtime). Later, before uploading ou... — committed to Asana/bazel by mmorearty 7 years ago
Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
Move FileStateValue.Type to Metadata; add Metadata.getType() This is in preparation for merging FileArtifactValue and FileStateValue. Progress on #3360. PiperOrigin-RevId: 179832948 — committed to bazelbuild/bazel by ulfjack 7 years ago
Disable concurrent change detection Regress on #3360. We have reports of Bazel outputting warnings for generated files, which I have been able to reproduce. Apparently, Bazel gets stuck with an old ... — committed to bazelbuild/bazel by ulfjack 6 years ago
Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
Fix RegularFileArtifactValue.equals If two SkyValue objects are equal, then Skyframe caches and returns the old one, instead of the new one. That's a problem for the detection of file changes, which ... — committed to bazelbuild/bazel by ulfjack 6 years ago
Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
Prevent broken cache entries on concurrent file changes Local execution has an inherent race condition: if a user modifies a file while an action is executed, then it is impossible for Bazel to tell ... — committed to Asana/bazel by ulfjack 6 years ago
Fix RegularFileArtifactValue.equals If two SkyValue objects are equal, then Skyframe caches and returns the old one, instead of the new one. That's a problem for the detection of file changes, which ... — committed to Asana/bazel by ulfjack 6 years ago
Prevent broken cache entries on concurrent file changes Local execution has an inherent race condition: if a user modifies a file while an action is executed, then it is impossible for Bazel to tell ... — committed to Asana/bazel by ulfjack 6 years ago
Fix RegularFileArtifactValue.equals If two SkyValue objects are equal, then Skyframe caches and returns the old one, instead of the new one. That's a problem for the detection of file changes, which ... — committed to Asana/bazel by ulfjack 6 years ago
Move FileStateValue.Type to Metadata; add Metadata.getType() This is in preparation for merging FileArtifactValue and FileStateValue. Progress on #3360. PiperOrigin-RevId: 179832948 — committed to werkt/bazel by ulfjack 7 years ago

Most upvoted comments

We can force which bazel version is used. But we can’t force or verify that each developer add the extra flag when needed. We are all humans, so it is normal that sometimes mistake will be made and the flag will be forgotten… And when this happens, the consequence is important and lose much humans time before being corrected.

At this point, I think we need someone to justify a big speed regression to justify tolerating a cache corruption problem. Something like the speed without/wo the flag. Without that, I think we should just change the default. Right now, the speed regression looks more an anecdote then a big real issue. (seem to be reported by only 1 person without any numbers).

The cache poisoning is a big real issue that affected many people.

+12

nouiz on May 19, 2021

That’s awesome, thank you for reporting it!

ola-rozenfeld on Jul 11, 2017

I agree, the default should be safe. For people that can guarantee the source file doesn’t change, giving them a speed up knob can be useful. But it should be safe by default.

nouiz on Nov 6, 2020

In FileContentsProxy via FileArtifactValue.

The check is already implemented, though. Just enable

--experimental_guard_against_concurrent_changes

ulfjack on Apr 15, 2020

It feels like there’s a “{Fast, Correct} - Choose two” joke in here somewhere 😃

cpsauer on May 13, 2021

This page (https://docs.bazel.build/versions/master/remote-caching.html#known-issues) say. " Input file modification during a build

When an input file is modified during a build, Bazel might upload invalid results to the remote cache. You can enable a change detection with the --experimental_guard_against_concurrent_changes flag. There are no known issues and it will be enabled by default in a future release. See issue #3360 for updates. Generally, avoid modifying source files during a build. "

Can it be enabled by default now as the bold text tell it will happens? People that do not want it can disable this. We need the cache to be safe by default.

nouiz on May 12, 2021

If this context helps, I can tell you that I don’t know how we would use Bazel if this flag wasn’t available. (We are a Mac shop.) Before this flag existed, corruption of our remote cache was somewhat rare, but when it did happen — perhaps once a month — we were pretty mystified and would have to spend a long time trying to figure out what happened and what to do about it.

mmorearty on Mar 31, 2021

What’s the status of this?

virusdave on Oct 18, 2019

No, I get it, and have read. But you’re seeing it manifest as serious performance issue for you, @uri-canva, worse than the cache poisoning? Any more details you can provide (here or in the other issue)?

What I was seeing on my repo/machines was that it’s trading off a minor performance issue for serious correctness/cache-poisoning issue–and human time is much more valuable than CPU time. Seems like some others maybe feel that way…but it’s totally possible we’re not hitting the issue the same way you are and are missing the severity of the trade off.

cpsauer on May 13, 2021

@buchgr @ulfjack Does anyone remember why this flag isn’t enabled by default? 😅 Is it because of the “There may be cases where the Linux kernel delays writing of files, which could cause false positives.” part in the documentation of the flag?

philwo on Mar 30, 2021

The 0.5.3 release will contain a bug fix by @ola-rozenfeld that should lower the chances of hitting the remote cache race by checking the input files ctime before and after execution. However, as @ulfjack explained to me, this will not fix the bug. The problem is that we need to compute the input’s ctime at the time the action’s hash is computed, which is usually much before the action is executed. So this fix still doesn’t address this problem, but just reduces the time window it can. We don’t know whether this is significant tough. @ulfjack mentioned we need to do some internal refactoring to be able to fix it for real. This will NOT be done in time for the 0.5.3 release.

buchgr on Jul 17, 2017