bazel-remote: Reject requests instead of crashing when overloaded

We experience bazel-remote crashes when local SSD is overloaded by write operations. It would be preferable if bazel-remote would reject requests instead of crashing.

The crash message contains: runtime: program exceeds 10000-thread limit and among the listed goroutines, there are almost 10000 stack traces containing either diskCache.Put or diskCache.removeFile, doing syscalls. They originate from both HTTP and gRPC requests.

What do you think about rejecting incoming requests when reaching a configurable number of max concurrent diskCache.Put invocations? Or are there better ways?

We have not experienced overload from read requests, the network interface is probably saturated before that would happen. Therefore, it seems beneficial to still allow read requests and get cache hits even when writes are rejected.

I’m uncertain about proper relation between the semaphore constant (currently 5000) for the diskCache.removeFile, the number of allowed diskCache.Put and Go’s default 10 000 operating system threads. Would it make sense to set one limit for the sum of diskCache.removeFile and diskCache.Put? Should bazel-remote support tuning https://pkg.go.dev/runtime/debug#SetMaxThreads?

It is important to not reject or hinder Prometheus HTTP requests, because the metrics are even more important in overload conditions.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

(I’m busier than normal at the moment, but wanted to mention that my goal is to merge some form of #680 this week, then turn my attention back to this issue.)