bazel: Sandbox slowness on OSX

Description of the problem / feature request:

building has been extremely slow with the default darwin-sandbox

Bugs: what’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

A mini repro could be found in https://github.com/alexeagle/rules_sass_repro The repro contains 40 empty sass files Running the sass compiler on them should be fast

bazel build :all takes ~60s on my mac

bazel build --strategy=SassCompiler=local :all takes ~4s

What operating system are you running Bazel on?

Mac OS 10.14.4

What’s the output of bazel info release?

release 0.25.0

Have you found anything relevant by searching the web?

I found these issues: https://github.com/bazelbuild/bazel/issues/902 and https://github.com/bazelbuild/bazel/issues/1836 but they all seem obsolete.

JSON profile

According to https://docs.bazel.build/versions/master/skylark/performance.html#json-profile, I grabbed profiles for different strategies:

profiles.zip

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 1
  • Comments: 98 (64 by maintainers)

Commits related to this issue

Most upvoted comments

@burdiyan, I’ve fallen for something similar before where clang was super-slow in sandboxed mode (20x slower) and it was because of the module cache being recreated all the time, not due to any particular slowness of sandbox-exec. Like @jmmv said, sandbox-exec is kind of an easy target, being unsupported and all, but actual benchmarks have never shown it to be particularly slow.

Just for kicks I tried building a small part of the LLVM compiler and benchmarked the following scenarios:

  1. local strategy
  2. darwin-sandbox strategy
  3. darwin-fake-sandbox strategy that uses a no-op sandbox-exec that just calls execvp(3)
  4. darwin-copy-sandbox strategy that uses the CopyingSandboxedSpawn mentioned by @larsrc-google in a comment above.

The code is pushed to the darwin-fake-sandbox branch here, I didn’t bother pluggin it to the build so if you want to try it you need to manually compile the fake-sandbox-exec program an change the path in the code.

These are the results:

❯ hyperfine -p "~/o/bazel/bazel-bin/src/bazel --batch clean" "~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=local --config=generic_clang @llvm-project//clang:clang-tblgen" "~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen" "~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-fake-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen" "~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-copy-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen"
Benchmark 1: ~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=local --config=generic_clang @llvm-project//clang:clang-tblgen
  Time (mean ± σ):     38.560 s ±  1.027 s    [User: 380.953 s, System: 31.037 s]
  Range (min … max):   36.691 s … 39.787 s    10 runs

Benchmark 2: ~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen
  Time (mean ± σ):     41.092 s ±  0.826 s    [User: 398.602 s, System: 54.100 s]
  Range (min … max):   39.875 s … 42.639 s    10 runs

Benchmark 3: ~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-fake-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen
  Time (mean ± σ):     40.776 s ±  0.741 s    [User: 391.842 s, System: 53.463 s]
  Range (min … max):   39.502 s … 42.046 s    10 runs

Benchmark 4: ~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-copy-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen
  Time (mean ± σ):     53.806 s ±  1.171 s    [User: 445.051 s, System: 115.300 s]
  Range (min … max):   52.555 s … 56.255 s    10 runs

Summary
  '~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=local --config=generic_clang @llvm-project//clang:clang-tblgen' ran
    1.06 ± 0.03 times faster than '~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-fake-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen'
    1.07 ± 0.04 times faster than '~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen'
    1.40 ± 0.05 times faster than '~/o/bazel/bazel-bin/src/bazel --batch build --spawn_strategy=darwin-copy-sandbox --config=generic_clang @llvm-project//clang:clang-tblgen'

As you can see, removing the sandbox-exec command is a wash and copying is slower than symlinking (news at 11).

I don’t see any great advantage to such a strategy, certainly not strong to merit the maintenance and extra complexity. We have too many sandboxing strategies already.

In this example since you’re running golang outside of a bazel rule it’s likely generating its own cache, which it is blocked from reading / writing to when using the sandbox.

I think there’re indeed some lower-hanging fruits to improve Bazel on macOS situation.

First of all I’m going to assume (and I might be ridiculously wrong about it) that for a lot of people when they talk about sandboxed build, the mostly care about input files isolation rather than network and other stuff like that, which a true sandbox gives you. This is true for me as well: I don’t care much about rules being able to access the network, because I know what the cost of it and I’m not going to do it, but I do want input files isolation, so that I know that I’m not missing to specify any inputs when I’m building the target.

If that is true for many people (at least on macOS), then Bazel could have several options to improve their lives (I know nothing about how complex implementing any of them could actually be):

  1. Make darwin-sandbox strategy only care about input files. So stop using sandbox-exec, and simply assemble execroots with input files only for each target. Need to measure whether copy or symlink works better here, and maybe this even could be a flag to choose. Don’t know if this could be really called a “sandbox” in this case though.
  2. Change the understanding of what local strategy is. Right now local targets have access to all the workspace files, and it is probably not something that most people want. Maybe Bazel could do the same as Please for its local strategy, i.e. no other isolation except for input files. Again copy vs. symlink could be benchmarked, or even be configurable.
  3. Do the same as previous, but making it a new type of strategy, e.g. local-isolated or whatever.
  4. Indeed develop a proper low-level sandboxing facility for macOS, which is probably a non-option because of how complicated or even impossible this may be.

Basically options 1-3 are all the same, but named differently, and might have different level of impact in terms of implementation.

I also got a trace profile:

image

as you can see the sandbox setup/teardown is where most of the time is spent. This is without --experimental_reuse_sandbox_directories

I don’t think there’s really a way out of this in the short term. I would suggest disabling sandboxing for local dev for performance, and enabling it for CI release builds in case anything slips through. This is what most folks are doing today.

Any updates on this? My team is waiting on this before moving to bazel

I think this problem is especially noticeable under nodejs rules, where the number of inputs is easily an order of magnitude more than most other ecosystems due to lack of archive files (every file in the package is a separate input) and the dependency problem in JS (hundreds of transitive dependencies for common tools like react-scripts). We’re working on a fix in rules_nodejs to provide each package as a directory (TreeArtifact) instead, though it’s a breaking change (you can no longer reference individual files with labels)

There is another performance issue: https://github.com/bazelbuild/bazel/issues/20584

I will take care of this one. After it’s checked in I will profile again and see what else we can do.

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs or one of the following labels is added: “not stale”, “awaiting-bazeler”. Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

Thanks for the details. @larsrc-google is the right person to look into this (after the holidays) and perhaps spawn a separate issue for sandbox-exec.

@jakeleventhal We would love to make it faster, but I think this would require a fundamental redesign of the implementation on macOS. So far we’ve struggled to find a better mechanism in macOS than using sandbox-exec, with which we could implement faster and reliable sandboxing. The system just doesn’t seem to provide any good APIs for this. If you know of any or tools which implement sandboxing on macOS, please send us pointers!

The most advanced sandboxing engine I know of is part of Microsoft’s BuildXL: https://github.com/microsoft/BuildXL/blob/master/Documentation/Specs/Sandboxing.md#macos-sandboxing but considering its complexity, so far no one has dared to look into if / how we could use it for Bazel.

We have been working to speed up sandboxing for our TypeScript Bazel build, which was originally often timing out at a 90 minute limit and frequently running at ~60 minutes (currently running without caching or remote build execution). We had identified the primary culprit as sandboxing slowness, which we observed both on Mac OS (our laptops) and Linux (our CI machines).

We had previously only enabled the new rules_nodejs exports_directories_only in our yarn_install, which dropped our TypeScript build down to 33-38 minutes, with occasional spikes of 55 minutes.

image

Yesterday, I tried adding both --experimental_reuse_sandbox_directories and --experimental_sandbox_async_tree_delete_idle_threads=1 in our build and it seems to have a good impact on top of the rules_nodejs exports_directories_only option. For our TypeScript build, this brings things down from 33-38 minutes with spikes of ~55 minutes to about 22-24 minutes steady (so far).

image

Zooming out for context, here are the max and average runtime trends for these jobs over the past 13 weeks, which includes all of these changes:

image

@fenghaolw Could you try running with the --experimental_reuse_sandbox_directories flag and see if that speeds up the sandboxing sufficiently?

Some tests I ran with this reproduction:

$ bazel build :all
INFO: Elapsed time: 36.556s, Critical Path: 7.71s
$ bazel build --spawn_strategy=local :all
INFO: Elapsed time: 5.569s, Critical Path: 1.23s
$ bazel build --experimental_use_sandboxfs :all
INFO: Elapsed time: 9.479s, Critical Path: 3.16s
$ bazel build --sandbox_debug :all
INFO: Elapsed time: 18.091s, Critical Path: 3.90s
$ bazel build --experimental_sandbox_async_tree_delete_idle_threads=auto :all
INFO: Elapsed time: 23.156s, Critical Path: 4.75s

And corresponding observations:

  1. Each action in this build has 11k files.
  2. sandboxfs does seem to help (as expected based on the previous).
  3. --sandbox_debug makes quite a bit of a difference. Deleting all the symlink trees is expensive, and this flag has the side-effect of not deleting them. But creating them is also quite expensive.
  4. The new --experimental_sandbox_async_tree_delete_idle_threads=auto helps approximate the behavior of --sandbox_debug and seems like a significant improvement over the current behavior. We should enable this new feature by default, but I remember seeing a crash recently that needs investigation…