bazel: builds with remote caching sometimes fails on incomplete downloads
Description of the problem / feature request:
Sometimes builds with http remote caching enabled fail with Failed to delete output files after incomplete download. Cannot continue with local execution.
See EG the logs here: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/
Bugs: what’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I’ve had some difficulty making this bug reproducible unfortunately, you need to have remote caching enabled and fail to download to an entry, which often still succeeds and falls back to local building.
What operating system are you running Bazel on?
A debian jessie based docker container.
What’s the output of bazel info release
?
release 0.11.0
Any other information, logs, or outputs that you want to share?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 36 (21 by maintainers)
Commits related to this issue
- remote: recursively delete incomplete downloaded output directory. Fixes #5047 — committed to buchgr/bazel by buchgr 6 years ago
- remote: recursively delete incomplete downloaded output directory. Fixes #5047 Closes #5209. PiperOrigin-RevId: 196832678 — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: recursively delete incomplete downloaded output directory. Fixes #5047 Closes #5209. PiperOrigin-RevId: 196832678 — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: fix race on download error. Fixes #5047 For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downl... — committed to buchgr/bazel by buchgr 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to natansil/bazel by buchgr 6 years ago
- remote: fix race on download error. Fixes #5047 For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downl... — committed to bazelbuild/bazel by buchgr 6 years ago
- Release 0.16.0 (2018-07-31) Baseline: 4f64b77a3dd8e4ccdc8077051927985f9578a3a5 Cherry picks: + 4c9a0c82d308d5df5c524e2a26644022ff525f3e: reduce the size of bazel's embedded jdk + d3228b61... — committed to bazelbuild/bazel by a-googler 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to werkt/bazel by buchgr 6 years ago
- remote: fix race on download error. Fixes #5047 For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downl... — committed to werkt/bazel by buchgr 6 years ago
- Release 0.16.0 (2018-07-31) Baseline: 4f64b77a3dd8e4ccdc8077051927985f9578a3a5 Cherry picks: + 4c9a0c82d308d5df5c524e2a26644022ff525f3e: reduce the size of bazel's embedded jdk + d3228b61... — committed to buchgr/bazel by a-googler 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: fix race on download error. Fixes #5047 For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downl... — committed to bazelbuild/bazel by buchgr 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to buchgr/bazel by buchgr 6 years ago
- remote: fix race on download error. Fixes #5047 For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downl... — committed to buchgr/bazel by buchgr 6 years ago
- Release 0.16.1 (2018-08-13) Baseline: 4f64b77a3dd8e4ccdc8077051927985f9578a3a5 Cherry picks: + 4c9a0c82d308d5df5c524e2a26644022ff525f3e: reduce the size of bazel's embedded jdk + d3228b61... — committed to bazelbuild/bazel by a-googler 6 years ago
- remote: limit number of open tcp connections by default. Fixes #5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some u... — committed to bazelbuild/bazel by buchgr 6 years ago
Ok that took a while, but I believe I understand the error now.
So effectively we asynchronously trigger concurrent downloads for all output files, which gives us a list of futures and we then at the end we wait for all downloads to finish. If one download fails, we immediately trigger a routine to delete all output files (which fails), so that we can continue with local execution instead.
However, when triggering this routine to delete all output files we don’t wait for all downloads to have finished. They continue downloading in the background. That routine to delete files, recursively deletes all files in a directory and then tries to delete the directory itself. Now that’s racing with the async downloads that are still running in the background T_T.
Async programming is hard. I ll send out a fix and make sure it gets cherry picked into 0.16.0 as this is a regression.
Ben,
can you point me to instructions for reproducing this? That is, which project to checkout and which targets to run. I ll try to understand what’s happening tomorrow.
@buchgr
GoStdLib
produces two directory outputs. Those directories can be pretty large and may contain a mix of files and subdirectories. That should be okay, right? I haven’t heard of any limitations on the contents of directory outputs.@BenTheElder
GoStdLib
is a particularly heavy directory output. It wouldn’t surprise me if it triggered a bug in the interaction between directory outputs and remote caching. If you can think of a possible root cause in rules_go, please let me know though.@BenTheElder We have a rule that builds the Go standard library for the target platform and mode. It’s used as an input for most other actions. It’s a directory output, but I guess that’s expressed as individual files over the wire.
I’m planning to change the implementation of that rule to split the standard library into pieces that can be used as separate inputs (tools, archives, sources, headers). We’d also stop including packages that can’t be imported from other Go code (e.g., vendored and internal packages in the standard library). That should improve performance quite a bit, both locally and with remote execution.
Hey Ben,
so in 0.15.0 we made a change for all action outputs to be downloaded in parallel. There is no concurrency limit applied by default, but I figured the number of action outputs would probably be somewhat reasonable and thus there would be a natural limit. I found after the 0.15.0 release (after my vacation) that rules_go to have actions with very high number of outputs and that we would then run into open file descriptor limits and those kind of things. I plan to add a ~200 connections limit in the 0.16.0 release, or 0.15.1 if there will be one.
I think with the
--remote_max_connections=200
you will not encounter the error, but the error itself is still unexplained. It almost seems as if Bazel is interpreting a directory as a file.Correct, the flag was introduced in 0.15.0
@BenTheElder sorry about that 😦. I don’t understand what’s happening there yet. Can you try running with
--remote_max_connections=200
. I think this will avoid the failures in the first place.I will try to reproduce the error.
Thanks for reporting. I ll take a look!