bazel: builds with remote caching sometimes fails on incomplete downloads

Description of the problem / feature request:

Sometimes builds with http remote caching enabled fail with Failed to delete output files after incomplete download. Cannot continue with local execution.

See EG the logs here: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/

Bugs: what’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I’ve had some difficulty making this bug reproducible unfortunately, you need to have remote caching enabled and fail to download to an entry, which often still succeeds and falls back to local building.

What operating system are you running Bazel on?

A debian jessie based docker container.

What’s the output of bazel info release?

release 0.11.0

Any other information, logs, or outputs that you want to share?

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62723/pull-kubernetes-bazel-build/38747/

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 36 (21 by maintainers)

Commits related to this issue

Most upvoted comments

Ok that took a while, but I believe I understand the error now.

So effectively we asynchronously trigger concurrent downloads for all output files, which gives us a list of futures and we then at the end we wait for all downloads to finish. If one download fails, we immediately trigger a routine to delete all output files (which fails), so that we can continue with local execution instead.

However, when triggering this routine to delete all output files we don’t wait for all downloads to have finished. They continue downloading in the background. That routine to delete files, recursively deletes all files in a directory and then tries to delete the directory itself. Now that’s racing with the async downloads that are still running in the background T_T.

Async programming is hard. I ll send out a fix and make sure it gets cherry picked into 0.16.0 as this is a regression.

Ben,

can you point me to instructions for reproducing this? That is, which project to checkout and which targets to run. I ll try to understand what’s happening tomorrow.

@buchgr GoStdLib produces two directory outputs. Those directories can be pretty large and may contain a mix of files and subdirectories. That should be okay, right? I haven’t heard of any limitations on the contents of directory outputs.

@BenTheElder GoStdLib is a particularly heavy directory output. It wouldn’t surprise me if it triggered a bug in the interaction between directory outputs and remote caching. If you can think of a possible root cause in rules_go, please let me know though.

@BenTheElder We have a rule that builds the Go standard library for the target platform and mode. It’s used as an input for most other actions. It’s a directory output, but I guess that’s expressed as individual files over the wire.

I’m planning to change the implementation of that rule to split the standard library into pieces that can be used as separate inputs (tools, archives, sources, headers). We’d also stop including packages that can’t be imported from other Go code (e.g., vendored and internal packages in the standard library). That should improve performance quite a bit, both locally and with remote execution.

Hey Ben,

so in 0.15.0 we made a change for all action outputs to be downloaded in parallel. There is no concurrency limit applied by default, but I figured the number of action outputs would probably be somewhat reasonable and thus there would be a natural limit. I found after the 0.15.0 release (after my vacation) that rules_go to have actions with very high number of outputs and that we would then run into open file descriptor limits and those kind of things. I plan to add a ~200 connections limit in the 0.16.0 release, or 0.15.1 if there will be one.

I think with the --remote_max_connections=200 you will not encounter the error, but the error itself is still unexplained. It almost seems as if Bazel is interpreting a directory as a file.

Correct, the flag was introduced in 0.15.0

@BenTheElder sorry about that 😦. I don’t understand what’s happening there yet. Can you try running with --remote_max_connections=200. I think this will avoid the failures in the first place.

I will try to reproduce the error.

Thanks for reporting. I ll take a look!