bazel-remote: Very high memory usage on v2.3.3 - is this configurable?

We’re experiencing severe memory issues w/ the cache since upgrading to v2.3.3 (from v1.1.0). These were asymptomatic for most of January and February, but started causing frequent cache OOMs following our upgrade to Bazel 5 at the beginning of last week. The memory footprint was already significantly higher prior to the Bazel 5 upgrade, however.

Prior to the bazel-remote cache upgrade (which took place 01/01/2022), memory usage was minimal. Following the upgrade, the cache process regularly uses up all the memory on the host (~92g), resulting in the OOM killer killing the cache.

usable_mem

We noticed that the used file handles count markedly dropped following the cache upgrade as well, which leads us to believe some actions that previously relied on heavy disk usage now occur in-memory.

remote_file_handles

A very large chunk of the memory usage occurs during cache startup. For example, following a crash at 2:10pm, the cache was holding 70g of memory by 2:29pm, which is when the cache finally started serving requests. You can see the memory usage trend for that OOM/restart (and two prior ones) on this screenshot:

remote_mem_used_2

The cache logs show

<~21:10:00 process starts - logs are truncated, so the exact timestamp is missing, but our service wrapper simply launches this docker container>
…
… <tons of "Removing incomplete file" logs>
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Sorting cache files by atime.
2022/03/07 21:26:26 Building LRU index.
2022/03/07 21:29:41 Finished loading disk cache files.
2022/03/07 21:29:41 Loaded 54823473 existing disk cache items.
2022/03/07 21:29:41 Mangling non-empty instance names with AC keys: disabled
2022/03/07 21:29:41 gRPC AC dependency checks: enabled
2022/03/07 21:29:41 experimental gRPC remote asset API: disabled
2022/03/07 21:29:41 Starting gRPC server on address :8081
2022/03/07 21:29:41 Starting HTTP server on address :8080
2022/03/07 21:29:41 HTTP AC validation: enabled
2022/03/07 21:29:41 Starting HTTP server for profiling on address :8082
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
…

Most of the memory surge occurs during the “Removing incomplete file” steps, and a second surge occurs as the LRU index is built.

Attempted Mitigations: We attempted restricting the memory allowance for the Docker container via the -m docker flag in hopes to at least keep the process from OOMing, but this did not suffice - the service became unresponsive.

Given that the memory issues became a much worse following the Bazel 5 upgrade, we tweaked these Bazel flags:

  • We unset the --experimental_remote_cache_async flag
  • We set --remote_max_connections=10 (we previously had it set to 0, which means no limit, but this didn’t affect grpc connections prior to Bazel 5).

Even if these help (we’ll find out as tomorrow’s workday picks up), we’ll still be very close to running out of memory (as we were through February, before the Bazel 5 upgrade).

Is there some way to configure how much memory the bazel-remote process utilizes?

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (1 by maintainers)

Most upvoted comments

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of “transient spike in the live heap size” https://tip.golang.org/doc/gc-guide

I added a similar suggestion to the systemd configuration example recently: https://github.com/buchgr/bazel-remote/commit/2bcc2f59e111f71b4de4d84013f8e93a1b981872

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of “transient spike in the live heap size” https://tip.golang.org/doc/gc-guide

@liam-baker-sm: Thanks for the report.

Which storage mode is bazel-remote using in this scenario? In the ideal setup, with bazel-remote storing zstd compressed blobs, and bazel requesting zstd blobs, they should be able to be streamed directly from the filesystem without recompression.

Hello, I can reproduce unusually high memory usage under very specific configuration.

  • GRPC connection between the bazel build and the bazel-remote server.
  • Compressed transfer ( --experimental_remote_cache_compression )
  • Toplelvel download ( --remote_download_toplevel ) With this combination, memory use on the cache server reaches 10GB Removing --remote_download_toplevel memory use on this server does not exceed 3GB.

Test is performed for a large build (~40GB of artefacts), on a single client on the same LAN. Server is for local office use and has a HTTP proxy backend defined, pointing to the main CI cache.

Had the same problem when disk size passed 1Tb it would oom on 64Gb memory server. Setting GOGC=20 solved the problem.

There are some notes on the GODEBUG environment variable here, it’s a comma separated list of settings: https://pkg.go.dev/runtime?utm_source=godoc#hdr-Environment_Variables

One of the settings is madvdontneed=0 to use MADV_FREE (the old setting) instead of MADV_DONTNEED. You can read a little about what they mean here: https://man7.org/linux/man-pages/man2/madvise.2.html

It might also be worth setting gctrace=1 to get some GC stats in your logs.

You can try also playing with the GOGC environment variable, to trigger GC more often (also described in the pkg.go link above).

Re the discrepancy between the memory profile’s view of memory usage and the systems, there are so many different ways to count memory usage that I think the first step is to try to understand what each tool is measuring. Is that a screenshot from top? Is it running inside docker, or outside?