bazel: Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unexpected Exception 'null' when closing BEP transports, this is a bug.
Description of the problem / feature request:
I was updating from bazel 3.4.1 to 3.5.0, and after flipping to use 3.5.0 our builds started to fail with the following exception:
[2020-09-29T12:54:59.710Z] Executed 0 out of 1550 tests: 590 tests pass and 960 were skipped.
[2020-09-29T12:54:59.720Z] There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.
[2020-09-29T12:54:59.725Z] INFO: Build completed successfully, 15248 total actions
[2020-09-29T12:54:59.737Z] Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unexpected Exception 'null' when closing BEP transports, this is a bug.
[2020-09-29T12:54:59.748Z] at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:475)
[2020-09-29T12:54:59.757Z] at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:550)
[2020-09-29T12:54:59.767Z] at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:565)
[2020-09-29T12:54:59.774Z] at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:628)
[2020-09-29T12:54:59.784Z] at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:617)
[2020-09-29T12:54:59.792Z] at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:236)
[2020-09-29T12:54:59.799Z] at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:661)
[2020-09-29T12:54:59.807Z] at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$2(GrpcServerImpl.java:726)
[2020-09-29T12:54:59.810Z] at io.grpc.Context$1.run(Context.java:605)
[2020-09-29T12:54:59.816Z] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[2020-09-29T12:54:59.823Z] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[2020-09-29T12:54:59.827Z] at java.base/java.lang.Thread.run(Unknown Source)
[2020-09-29T12:54:59.836Z] java.lang.RuntimeException: Unexpected Exception 'null' when closing BEP transports, this is a bug.
[2020-09-29T12:54:59.850Z] at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:475)
[2020-09-29T12:54:59.863Z] at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:550)
[2020-09-29T12:54:59.875Z] at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:565)
[2020-09-29T12:54:59.884Z] at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:628)
[2020-09-29T12:54:59.895Z] at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:617)
[2020-09-29T12:54:59.904Z] at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:236)
[2020-09-29T12:54:59.914Z] at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:661)
[2020-09-29T12:54:59.922Z] at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$2(GrpcServerImpl.java:726)
[2020-09-29T12:54:59.927Z] at io.grpc.Context$1.run(Context.java:605)
[2020-09-29T12:54:59.935Z] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[2020-09-29T12:54:59.943Z] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[2020-09-29T12:54:59.948Z] at java.base/java.lang.Thread.run(Unknown Source)
Unfortunately I do not have an exact reproduction case, as this appears to be an underlying timing issue, and occurs when we have a large set of build/test targets run using a BRE environment.
The error only occurs when we are writing out the build events as json, and there is a large set of targets, in my case 1550 tests, which leads to 15248 build actions. In my case I can reproduce the issue in our pipelines and locally fairly consistently with our target set of //…
bazel test --build_event_json_file=/tmp/build_events.json --config=remote-fast //...
In the above --config=remote-fast enables remote caching and remote execution to a buildbarn BRE setup.
I started to bisect to find which change between bazel 3.4.1 and 3.5.0, triggered the issue.
* BROKEN * d30b1997420b17af7864b46f2065269b42d692ba Caclulate configurable attribute values with iteration instead of recursion
* WORKS * 30bfcdf98470c85bb00d98487becb00d82cb82d6 bazel syntax: break lib.syntax -> skyframe.serialization dependency
* WORKS * 87e3017b8761c6769fbaa1f1370fed6df82fc9cc Support signing key rotation in android_binary.
932133f890608e15a5bc45b976c28b6b7df93274 bazel syntax: break lib.syntax tests -> skyframe.serialization.testutils dependency
* WORKS * 993ef784083da11f8e5e4958ca2c96d61743c4c6 The great Skylark -> Starlark class migration:
8874477c9ec5700efe9b924a3be76099ebd6ede7 Restore shared_library attribute in Starlark's cc_import
* WORKS * 003f7664c91f4fec81e7607c0b59651915836d3a Failed target/aspect completions include resolved artifact paths in BEP.
However after reverting just d30b1997420b17af7864b46f2065269b42d692ba the issue still occurs, this lead me to believe that this is more a timing issue, and it was a combination of changes, that cause an underlying bug to be tickled. Some of the changes leading up to the exception, are large and maybe cause timings that expose an existing race condition?
I modified the suspect area to remove the e.getCause().getMessage that is printing the ‘null’ in the exception catch log.
- e.getCause().getMessage()));
+ e.getMessage()));
This shows the true exception of ‘java.lang.UnsupportedOperationException’
Unexpected Exception 'java.lang.UnsupportedOperationException' when closing BEP
Following the trail from this call to block while waiting for for the BEP to closeBepTransports try (AutoProfiler p = GoogleAutoProfilerUtils.logged(“waiting for BES close”)) { Uninterruptibles.getUninterruptibly(Futures.allAsList(transportFutures.values())); } I changed things up a bit more to simply catch the exception then print the stack, subverting the RuntimeException handling. This leads to the following exception.
java.lang.UnsupportedOperationException
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:526)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:487)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196)
at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:463)
at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:541)
at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:556)
at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:628)
at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:617)
at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:236)
at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:661)
at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$2(GrpcServerImpl.java:726)
at io.grpc.Context$1.run(Context.java:605)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.UnsupportedOperationException
at com.google.devtools.build.lib.remote.RemoteActionFileSystem$1.getNodeId(RemoteActionFileSystem.java:317)
at com.google.devtools.build.lib.actions.cache.DigestUtils$CacheKey.<init>(DigestUtils.java:104)
at com.google.devtools.build.lib.actions.cache.DigestUtils.manuallyComputeDigest(DigestUtils.java:249)
at com.google.devtools.build.lib.actions.cache.DigestUtils.getDigestWithManualFallback(DigestUtils.java:230)
at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:64)
at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:60)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.readPathMetadata(ByteStreamBuildEventArtifactUploader.java:129)
at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$upload$3(ByteStreamBuildEventArtifactUploader.java:222)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
So it looks like there is a operation of getNodeID called but that is not supported.
at com.google.devtools.build.lib.remote.RemoteActionFileSystem$1.getNodeId(RemoteActionFileSystem.java:317)
And that looks like this might be the root cause of the issue the statFromRemoteMetadata throws a ‘UnsupportedOperationException’ https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java;drc=03be73edcf5d41bbef31220f9ebf86f1ee629103;l=317
Unfortunately I do not know enough of the internals of Bazel to fix the root issue. Hoping with this information someone familiar with this code, will be able to isolate further.
What operating system are you running Bazel on?
MacOS, Linux
What’s the output of bazel info release
?
release 3.5.0
Have you found anything relevant by searching the web?
No issues found that match the signature of the exception.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 43 (23 by maintainers)
Commits related to this issue
- Add some debugging for bizarre BES-RemoteActionFileSystem interaction bug. #12225 PiperOrigin-RevId: 336148320 — committed to bazelbuild/bazel by janakdr 4 years ago
- Debugging for #12225 — committed to bazelbuild/bazel by janakdr 4 years ago
- More debugging for #12225 — committed to bazelbuild/bazel by janakdr 4 years ago
- Add AggregatingArtifactValue#toString. #12225 PiperOrigin-RevId: 336368405 — committed to bazelbuild/bazel by janakdr 4 years ago
- Add ActionInputMap#toString: will be used for debugging. #12225 PiperOrigin-RevId: 337106342 — committed to bazelbuild/bazel by janakdr 4 years ago
- Expand ActionInputMap#toString a bit. #12225 PiperOrigin-RevId: 337911317 — committed to bazelbuild/bazel by janakdr 4 years ago
- Manual rollback of commit 95bad272d1075924443db83449fba08524c292de. *** Reason for rollback *** https://github.com/bazelbuild/bazel/issues/12225 *** Original change description *** Reduce number o... — committed to bazelbuild/bazel by joeleba 4 years ago
- Manual rollback of commit 95bad272d1075924443db83449fba08524c292de. *** Reason for rollback *** https://github.com/bazelbuild/bazel/issues/12225 *** Original change description *** ... — committed to luca-digrazia/DatasetCommitsDiffSearch by deleted user 2 years ago
Yeah, unfortunately the name of the file is not in the RemoteFileArtifactValue, which I knew, but was avoiding confronting 😃 . Will mail you a PR not for submission that adds more debugging.
There have been some changes in the area of computing digests. @justinhorvitz made one, https://github.com/bazelbuild/bazel/commit/2983d0454ace4fcbe84c118eff0d7be6c7db7339. I audited it and didn’t see anything obvious that would cause this failure. It looks like this codepath is unexpected: RemoteActionFileSystem is supposed to have fast digests, so we should never get to the point where we’re looking for the node id. I can add some logging that will show the name of the file that’s causing this issue, which might help.