bazel: Java crashes due to hsperfdata file conflicts across sandboxes

Running bazel build on a fat java/scala project (several thousands of targets) fails when working on linux debian with user namespace enabled.

Issue

Trying to run bazel build with user namespace enabled:

$ sysctl kernel.unprivileged_userns_clone=1

The build runs alright but at some point it crashes with weird memory issue:

ERROR: <target-path>/BUILD:35:1: error executing shell command: '
  rm -rf bazel-out/local-fastbuild/bin/<package>/<target>.jar_temp_resources_dir
  set -e
  mkdir -p bazel-out/local-fastbuild/bin/<target>' failed: Process terminated by signal 6 [sandboxed].
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f094606874b, pid=5, tid=0x00007f09472e0700
#
# JRE version:  (8.0_131-b11) (build )
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x96874b]  PerfMemory::alloc(unsigned long)+0x7b
#
# Core dump written. Default location: /home/builduser/.cache/bazel/_bazel_builduser/bc0e462ab01ac9379d22ad058ca1cb1f/bazel-sandbox/4864102460254154064/execroot/__main__/core or core.5
#
# An error report file with more information is saved as:
# /home/builduser/.cache/bazel/_bazel_builduser/bc0e462ab01ac9379d22ad058ca1cb1f/bazel-sandbox/4864102460254154064/execroot/__main__/hs_err_pid5.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

Environment info

The machine is docker container based on debian image

$ uname -a
Linux 167-docker99 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux
builduser@167-docker99:~/ws/bazel-port-isolation$ cat /etc/*-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"Bazel version

additional information

  • issue does not happen when unprivileged_userns_clone=0 (but clearly - that’s not a solution)
  • with user namespace enabled, bazel 0.5.1 showed this issue . May also be related to #3064 .

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 67 (56 by maintainers)

Commits related to this issue

Most upvoted comments

Any updates on this? I started hitting this issue very often on all linux machines in CI.

My current workaround is to add the following to my .bazelrc

build --enable_platform_specific_config

build:linux --sandbox_tmpfs_path=/tmp

Yeah, in fact, I have to stop myself from campaigning for the flag flip to go into Bazel 6.0 😃

Emotionally, I feel like “Bazel can’t reliably run a JVM in an action” is quite an embarrassing issue, although the fact that this bug has been open for more than five years seems to imply that it’s a less serious issue than my feelings say.

Interestingly, HotSpot recently fixed this problem, too: openjdk/jdk@84f2314

Unfortunately fixing the problem was accompanied by a warning message printed to STDOUT, which is breaking some of our build actions that write to STDOUT (https://github.com/google/google-java-format), and filling our build logs with hundreds of those warnings for all other JVM-tool actions. I’m not sure how we were not affected by the crash, but are now affected by the logging, but hopefully --incompatible_sandbox_hermetic_tmp will fix that new problem for us. We’re still on Bazel 5.3, but we’ll be sure to try this flag when we can.

Update: now running the JVM in the sandbox should be stable with the --incompatible_hermetic_sandbox_tmp command line option with Bazel@HEAD (after 8e32f44)

I’d appreciate if you gave it a try; we are planning to flip that flag eventually (right, @larsrc-google ?) and thus the more testing and in the more diverse environment, the better.

@philwo thanks, now I get it; for some reason I thought that the sandbox individually bind mounts the input files instead of symlinking.

I’d take special-casing Java in the sandbox code over Java randomly crashing in actions any day; that way, at least it’s only us who get to see the ugliness.

I was wondering if we could get away with a more-complicated-than-seems-necessary solution:

  • We bind mount the workspace, bazel-out/ and $OUTPUT_BASE/external/ to well-known locations in the sandbox (/bazel-workspace, /bazel-out, /bazel-external or something, doesn’t really matter)
  • We direct the symlinks in the sandbox to those directories and not to their “real” locations
  • We bind mount /tmp to an empty directory as above

Then the sandbox would see consistent output paths, it would work if the output base or the workspace is under /tmp and it wouldn’t clash with anything on the “real” file system except with these /bazel-* paths if someone is mad enough to have those on their file system or maybe in nested Bazel invocations (but even then, one could add a unique string per action to the path)

However, IIRC runfiles trees contain absolute symlinks to their contents, so they would break if they are symlinked “naively”. It’s not an unsolvable problem because the remote execution strategy solves it, but it does require some extra thinking.

Why do we have a PID namespace in the sandbox?

The reason why this bug exists is that actions have separate PID namespaces but a mostly shared file system, which strikes me as odd: we either try to isolate actions as fully as possible (but then how come they share /tmp?) or we only try to protect against mostly-accidental hermeticity violations (but then why the PID namespace?)