bazel: Crash with `--experimental_remote_download_outputs=toplevel`

Description of the problem / feature request:

When using --experimental_remote_download_outputs=toplevel, and having a cache return a 500, bazel crashes. I would expect it to recover in this case and fall back to retrying the request, or worst case doing the build locally. Without this flag it does seem to be resilient to this.

ERROR: /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/com_google_protobuf/BUILD:388:1: Linking of rule '@com_google_protobuf//:protoc' failed due to unexpected I/O exception: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>

com.google.devtools.build.lib.remote.blobstore.http.HttpException: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>

	at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:116)
	at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:41)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1476)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1225)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1272)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

Bugs: what’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel run TARGET when pointing at an unhealthy http cache and passing --experimental_remote_download_outputs=toplevel

What operating system are you running Bazel on?

macOS

What’s the output of bazel info release?

release 0.26.0

cc @buchgr

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 32 (28 by maintainers)

Most upvoted comments

Bazel should run the local action to rebuild the missing entry. I will work on this.

@keith disable garbage collection on GCS. The issue is that GCS does not understand the action graph of what it’s caching and evicts entries purely based on time. So it might have a cache entry for an action A1 but deleted all the outputs of this action. If an action A2 now needs the outputs of A1, then Bazel can’t download them and also can’t re-run action A1 and thus has to print this error.

I ll have to write a blog post about this and discussing potential ways for mitigation. For GCS specifically the only currently viable strategy is to disable garbage collection and wipe the whole cache from time to time. I imagine it would be easy enough to write a cloud function (or something) that reads in the action graph from GCS and properly evicts items from time to time.

@keith to fallback to local execution. it’s a bug either way.

Sorry to be late. It looks like there are different issues posted on this thread.

  1. AC exists, but failed to download from CAS. https://github.com/bazelbuild/bazel/issues/8508#issuecomment-509755175, https://github.com/bazelbuild/bazel/issues/8508#issuecomment-800649425
    • This requires action rewinding and I will try to find time working on this in Q2.
  2. load_shed issue. https://github.com/bazelbuild/bazel/issues/8508#issuecomment-689049505
    • The changes to correctly handle GOWAY errors are included in 4.1.0rc1. (gRPC dynamic connection pool)
  3. HTTP cache issue. https://github.com/bazelbuild/bazel/issues/8508#issuecomment-691406179

@brentleyjones, can you please share more details about the error? (e.g. enable --verbose_failures)

@keith Do you mind we close this issue and track it in #8250?

I’m collecting all issues for build-without-the-bytes in the tracking issue (https://github.com/bazelbuild/bazel/issues/6862). I’m making slow progress getting through the backlog. If you have issues that are easy to repro, I can take a look.