bazel: Important race condition when using remote cache without remote build
Description of the problem / feature request / question:
Our org has about 50 engineers using Bazel, with a remote cache but without remote build. About once every 3-4 weeks, I would hear of a case of someone’s build that was failing in strange ways when the remote cache was turned on, but would succeed when the cache was off.
I finally got a repro case, and was able to figure out what was happening. Bazel’s remote spawn strategy does its work in this order:
- Compute the hash of the action (which includes the hash of all the input files)
- Check if the remote cache contains the action result matching that hash
- If it does, then download it
- If it doesn’t, then do a local build and upload the action result
But if the user makes any changes to the sources (e.g. saves a file, or uses git to check out a different branch of their code) after step 1, but before the build in step 3, then you will end up with an incorrect (corrupt) entry in the cache: Its hash matches the input files when step 1 executed, but its contents do not match what would have been built from those input files — its contents match what was built from the modified input files.
From that point on, all other developers on the team who try to build that same code will download the incorrect build results.
Interestingly, I think this cannot cause cache problems if you are using remote cache WITH remote build. In that scenario, Bazel’s SimpleBlobStoreActionCache.uploadFileContents()
will first read the file into memory, then compute its hash, and upload the already in-memory blob. So the file contents in the cache are guaranteed to be accurate (the source file’s hash matches the source file’s contents). I haven’t tried it, but I imagine that the worst that could happen — and I’m not even sure if this could happen — is that the remote build might fail because it can’t find the input files it wants. No serious harm done.
If possible, provide a minimal example to reproduce the problem:
This BUILD
file has a rule called :race
which, when built, will first sleep for a few seconds, and then will simply cat foo.in
to foo.out
:
genrule(
name = "race",
srcs = ["foo.in"],
outs = ["foo.out"],
cmd = "sleep 10; cat $(SRCS) > $@"
)
Make a file foo.in
: echo 1 > foo.in
Start a build with a command line similar to this one (for this test, I have nginx running on localhost acting as a remote cache; you will need to set up some sort of similar remote cache):
bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build --spawn_strategy=remote --remote_rest_cache=http://localhost:8000/bazel-cache/race :race
While the build is happening — during the sleep 10
— in another shell, change the contents of foo.in
: echo 2 > foo.in
The build will complete, and bazel-genfiles/foo.out
will, not surprisingly, contain 2
. That’s not a bug. But then:
- Do a clean:
bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean
- Reset
foo.in
to its old contents:echo 1 > foo.in
- Do another
bazel build
, with the same command line as above.
At this point, bazel-genfiles/foo.out
will contain 2
, because that’s what Bazel found in the remote cache. This is definitely wrong.
Environment info
- Operating System:
OSX; probably on Linux too, I haven’t tested it
- Bazel version (output of
bazel info release
):
0.5.2
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 15
- Comments: 61 (54 by maintainers)
Commits related to this issue
- Don't upload corrupt artifacts to remote cache Workaround for https://github.com/bazelbuild/bazel/issues/3360. After an action finishes executing, we then re-check the timestamps of all of the input ... — committed to Asana/bazel by mmorearty 7 years ago
- remote: Lower the chances of a race condition in the remote upload. Lower the chances of triggering #3360. Immediately before uploading output artifacts, check if any of the inputs ctime changed, and... — committed to bazelbuild/bazel by deleted user 7 years ago
- Avoid a race condition in the remote upload Avoid triggering #3360. Immediately before a file's digest is computed for the first time, cache its last-modified time (mtime). Later, before uploading ou... — committed to Asana/bazel by mmorearty 7 years ago
- Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
- Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
- Move FileStateValue.Type to Metadata; add Metadata.getType() This is in preparation for merging FileArtifactValue and FileStateValue. Progress on #3360. PiperOrigin-RevId: 179832948 — committed to bazelbuild/bazel by ulfjack 7 years ago
- Disable concurrent change detection Regress on #3360. We have reports of Bazel outputting warnings for generated files, which I have been able to reproduce. Apparently, Bazel gets stuck with an old ... — committed to bazelbuild/bazel by ulfjack 6 years ago
- Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
- Fix RegularFileArtifactValue.equals If two SkyValue objects are equal, then Skyframe caches and returns the old one, instead of the new one. That's a problem for the detection of file changes, which ... — committed to bazelbuild/bazel by ulfjack 6 years ago
- Fix race condition Fix for https://github.com/bazelbuild/bazel/issues/3360 — committed to Asana/bazel by mmorearty 7 years ago
- Prevent broken cache entries on concurrent file changes Local execution has an inherent race condition: if a user modifies a file while an action is executed, then it is impossible for Bazel to tell ... — committed to Asana/bazel by ulfjack 6 years ago
- Fix RegularFileArtifactValue.equals If two SkyValue objects are equal, then Skyframe caches and returns the old one, instead of the new one. That's a problem for the detection of file changes, which ... — committed to Asana/bazel by ulfjack 6 years ago
- Prevent broken cache entries on concurrent file changes Local execution has an inherent race condition: if a user modifies a file while an action is executed, then it is impossible for Bazel to tell ... — committed to Asana/bazel by ulfjack 6 years ago
- Fix RegularFileArtifactValue.equals If two SkyValue objects are equal, then Skyframe caches and returns the old one, instead of the new one. That's a problem for the detection of file changes, which ... — committed to Asana/bazel by ulfjack 6 years ago
- Move FileStateValue.Type to Metadata; add Metadata.getType() This is in preparation for merging FileArtifactValue and FileStateValue. Progress on #3360. PiperOrigin-RevId: 179832948 — committed to werkt/bazel by ulfjack 7 years ago
We can force which bazel version is used. But we can’t force or verify that each developer add the extra flag when needed. We are all humans, so it is normal that sometimes mistake will be made and the flag will be forgotten… And when this happens, the consequence is important and lose much humans time before being corrected.
At this point, I think we need someone to justify a big speed regression to justify tolerating a cache corruption problem. Something like the speed without/wo the flag. Without that, I think we should just change the default. Right now, the speed regression looks more an anecdote then a big real issue. (seem to be reported by only 1 person without any numbers).
The cache poisoning is a big real issue that affected many people.
That’s awesome, thank you for reporting it!
I agree, the default should be safe. For people that can guarantee the source file doesn’t change, giving them a speed up knob can be useful. But it should be safe by default.
In FileContentsProxy via FileArtifactValue.
The check is already implemented, though. Just enable
--experimental_guard_against_concurrent_changes
It feels like there’s a “{Fast, Correct} - Choose two” joke in here somewhere 😃
This page (https://docs.bazel.build/versions/master/remote-caching.html#known-issues) say. " Input file modification during a build
When an input file is modified during a build, Bazel might upload invalid results to the remote cache. You can enable a change detection with the --experimental_guard_against_concurrent_changes flag. There are no known issues and it will be enabled by default in a future release. See issue #3360 for updates. Generally, avoid modifying source files during a build. "
Can it be enabled by default now as the bold text tell it will happens? People that do not want it can disable this. We need the cache to be safe by default.
If this context helps, I can tell you that I don’t know how we would use Bazel if this flag wasn’t available. (We are a Mac shop.) Before this flag existed, corruption of our remote cache was somewhat rare, but when it did happen — perhaps once a month — we were pretty mystified and would have to spend a long time trying to figure out what happened and what to do about it.
What’s the status of this?
No, I get it, and have read. But you’re seeing it manifest as serious performance issue for you, @uri-canva, worse than the cache poisoning? Any more details you can provide (here or in the other issue)?
What I was seeing on my repo/machines was that it’s trading off a minor performance issue for serious correctness/cache-poisoning issue–and human time is much more valuable than CPU time. Seems like some others maybe feel that way…but it’s totally possible we’re not hitting the issue the same way you are and are missing the severity of the trade off.
@buchgr @ulfjack Does anyone remember why this flag isn’t enabled by default? 😅 Is it because of the “There may be cases where the Linux kernel delays writing of files, which could cause false positives.” part in the documentation of the flag?
The 0.5.3 release will contain a bug fix by @ola-rozenfeld that should lower the chances of hitting the remote cache race by checking the input files
ctime
before and after execution. However, as @ulfjack explained to me, this will not fix the bug. The problem is that we need to compute the input’sctime
at the time the action’s hash is computed, which is usually much before the action is executed. So this fix still doesn’t address this problem, but just reduces the time window it can. We don’t know whether this is significant tough. @ulfjack mentioned we need to do some internal refactoring to be able to fix it for real. This will NOT be done in time for the 0.5.3 release.