bazel-remote: Reject requests instead of crashing when overloaded
We experience bazel-remote crashes when local SSD is overloaded by write operations. It would be preferable if bazel-remote would reject requests instead of crashing.
The crash message contains: runtime: program exceeds 10000-thread limit and among the listed goroutines, there are almost 10000 stack traces containing either diskCache.Put or diskCache.removeFile, doing syscalls. They originate from both HTTP and gRPC requests.
What do you think about rejecting incoming requests when reaching a configurable number of max concurrent diskCache.Put invocations? Or are there better ways?
We have not experienced overload from read requests, the network interface is probably saturated before that would happen. Therefore, it seems beneficial to still allow read requests and get cache hits even when writes are rejected.
I’m uncertain about proper relation between the semaphore constant (currently 5000) for the diskCache.removeFile, the number of allowed diskCache.Put and Go’s default 10 000 operating system threads. Would it make sense to set one limit for the sum of diskCache.removeFile and diskCache.Put? Should bazel-remote support tuning https://pkg.go.dev/runtime/debug#SetMaxThreads?
It is important to not reject or hinder Prometheus HTTP requests, because the metrics are even more important in overload conditions.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 15 (11 by maintainers)
Commits related to this issue
- Reject high-cost requests instead of creating more OS threads when overloaded We have been using a file removal semaphore with weight 5,000 (half of Go's default 10,000 maximum OS threads, beyond whi... — committed to mostynb/bazel-remote by mostynb a year ago
- Reject high-cost requests instead of creating more OS threads when overloaded We have been using a file removal semaphore with weight 5,000 (half of Go's default 10,000 maximum OS threads, beyond whi... — committed to ulrfa/bazel-remote by mostynb a year ago
(I’m busier than normal at the moment, but wanted to mention that my goal is to merge some form of #680 this week, then turn my attention back to this issue.)