go: runtime/debug: GOMEMLIMIT prolonged high GC CPU utilization before container OOM

What version of Go are you using (go version)?

go1.19.5, linux/amd64

Does this issue reproduce with the latest release?

yes

What did you do?

I ran a slowly leaking memory application with GOMEMLIMIT and GOGC=-1.

What did you expect to see?

I expected the application to run fine for a while. After a few days, I expected the CPU utilization to increase gradually (slightly, by <25-50%). I then the application to OOMed by the kernel.

What did you see instead?

After the initial expected GC CPU utilization increase, the utilized CPU increased dramatically (75% of the CPU time available to the container). The application remained in a degraded state for a long time (5-8+ hours).

Report

This is a continuation of a conversation started in https://github.com/golang/go/issues/48409#issuecomment-1399433014 - I thought I’d start a new issue.

tldr;

  • GOMEMLIMIT works great for 95% of our applications; we’d like to enable it globally
  • the remaining 5% are applications slowly leaking memory; it’s unfeasible to fix them
  • for these applications, we’re seeing significant degradation for a couple of hours before the container is OOMed
  • we’d want the applications to die faster (and be restarted as they are now) - the degradation is too “expensive”.

Investigation

Hey @mknyszek, I come bringing more data. In summary:

  • testing “slow memory leaks” is slow
  • I’m able to reproduce the degradation on both staging and production consistently
  • I should be able to get gctraces; I’m giving the staging applications more time to crash
  • the CPU utilization degradation is significant (75% of the container)

I was testing two of our applications via dynamically adjusting GOMEMLIMIT.

1. A Latency-sensitive proxy in production.

Observations:

  • I’m able to consistently trigger the degradation by setting the memory manually
  • surprising: the CPU usage increase is not gradual, but sudden (1.5 core -> 3.8 core)
  • 200mb difference in GOMEMLIMIT can trigger the degraded behavior
  • the RSS usage increases by ~40mb/hour, so it would take ~10 hours for OOM to kick in with GOMEMLIMIT=95%.

The container runs with 4 GOMAXPROCS (4 cores cgroup limits) and 10gb of memory. At the time of the tests, cgroup reported RSS was 9.2GiB.

The test was:

  • I started reducing GOMEMLIMIT from 10GiB in 100MiB batches down (the first spike on the graph) – 9563676416 - normal – 9463676416 - increase to 150% – 9363676416 - increase to 300%
  • I reverted GOMEMLIMIT to 9563676416,
  • after 1 hour we start seeing some CPU increase, after 2 hours CPU utilization is at ~1.3-1.4 core
  • after 4 hours the utilization goes through the roof.
image

Container CPU utilization before the test was fairly stable for ~2 days, so I don’t think any external factors affect the test. The previous spikes are me modifying the MEMLIMIT manually. image

Benchmarks

I ran three benchmarks during “normal” and “high CPU” periods, (1) unlimited throughput, (2) 1000 RPS, (3)100 RPS. The throughput seems (surprisingly) only to be ~10% down; latency changes are visible though. This is the same host, with GOMEMLIMIT set at all times, without restarts.

Benchmarks

All benchmarks are “degraded” followed by “non-degraded”.

Unlimited throughput:

Benchmark parameters:                       
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    0
  Max duration:    1m0s
  Max RPS:         0
                                                      
                                                      
Latencies:       
  0.5000: 1.505503ms
  0.9000: 3.116004ms
  0.9500: 3.940367ms   
  0.9900: 7.068518ms
  0.9990: 29.626833ms
  0.9995: 97.738173ms
  1.0000: 901.724891ms
Elapsed time (seconds):         60.05
Total requests:                 1899024
RPS:                            31622.10

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    0
  Max duration:    1m0s
  Max RPS:         0
Latencies:
  0.5000: 1.186803ms
  0.9000: 3.468514ms
  0.9500: 4.749452ms
  0.9900: 9.265477ms
  0.9990: 27.223769ms
  0.9995: 65.71919ms
  1.0000: 538.934235ms
Elapsed time (seconds):         60.01
Total requests:                 2091010
RPS:                            34842.09

1000 RPS:

Benchmark parameters:
  CPUs:            32
  Connections:     64                                 
  Concurrency:     1 
  Max requests:    60000
  Max duration:    1m0s
  Max RPS:         1000
Latencies:              
  0.5000: 649.981µs    
  0.9000: 1.512459ms   
  0.9500: 16.059954ms  
  0.9900: 612.542139ms 
  0.9990: 840.800883ms
  0.9995: 860.559206ms
  1.0000: 930.54864ms 
Elapsed time (seconds):         60.00
Total requests:                 52173
RPS:                            869.55

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    60000
  Max duration:    1m0s
  Max RPS:         1000
Latencies:
  0.5000: 422.508µs
  0.9000: 694.131µs
  0.9500: 1.017692ms
  0.9900: 33.982008ms
  0.9990: 112.48922ms
  0.9995: 120.384829ms
  1.0000: 132.146788ms
Elapsed time (seconds):         60.00
Total requests:                 59770
RPS:                            996.16

100RPS:

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    6000
  Max duration:    1m0s
  Max RPS:         100
Latencies:
  0.5000: 846.241µs
  0.9000: 97.532361ms
  0.9500: 348.847315ms
  0.9900: 643.727969ms
  0.9990: 773.982663ms
  0.9995: 798.205955ms
  1.0000: 827.131386ms
Elapsed time (seconds):         60.00
Total requests:                 5992
RPS:                            99.87

Benchmark parameters:
  CPUs:            32
  Connections:     64
  Concurrency:     1
  Max requests:    6000
  Max duration:    1m0s
  Max RPS:         100

Latencies:
  0.5000: 620.282µs
  0.9000: 940.553µs
  0.9500: 1.44535ms
  0.9900: 70.505245ms
  0.9990: 145.183217ms
  0.9995: 157.922399ms
  1.0000: 187.452379ms
Elapsed time (seconds):         60.00
Total requests:                 6000
RPS:                            100.00  

2. A leader elected not-so-important application.

The leader allocates a lot. The backup containers gather some data (memory “leaks” on all), but don’t allocate as much. Containers run with 3 CPUs, and 16 GiB memory.

The test: I restarted and set GOMEMLIMIT=4Gib on all containers. I let them run.

image

We see:

  • gradual CPU usage increase over time (this is the GC, there’s no external work affecting the usage)
  • after 24 hours, my GOMEMLIMIT expires (we have an enforced timeout), and the CPU utilization briefly drops back to “base”.
  • I reset the GOMEMLIMIT to 4GiB, CPU usage is now higher
  • the CPU is stable for 2 hours
  • after 2 hours CPU starts to increase to 3 cores (max)
  • the container crashes for an unrelated reason
  • a backup picks the work up, it’s immediately throttled (it should have as much leaked data as the previous leader)
  • the memory continues to leak at ~100mb/5h - with GOMEMLIMIT at 95% of the container memory, it would be 8 hours before the crash.

This case is interesting because it’s similar to a “failover” scenario where the application might suddenly get 50% extra traffic.

We measure cgroups nr_throttled periods and calculate “healthiness”. image

I’ll try to get gctraces over the next few days.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 34 (22 by maintainers)

Most upvoted comments

You are correct that a constant growth will result in a non-constant increase in GC CPU growth and your reasoning is sound to me.

I was just leaving myself a mental note to look at that GC trace in general since I was still catching up (I hadn’t looked closely at every log yet). 😅 What I wanted to double-check is if that was happening before or after the push past the limit, but you already annotated the log with that info so yup! It’s pretty clear this is WAI. Thanks.

I was running some more tests; adding more screenshots - they show the same behavior mostly.

Here we see GC using 2x more CPU than the actual work (66% of the container). image

Here it looks even more dramatic, it looks like GC was using 10x more CPU than the actual “work”: image

@zephyrtronium - that’s “kinda” what we actually do in our pre-GOMEMLIMIT setup. We watch the runtime GC stats and update GOGC to slow down (and eventually OOM). This gives us a bit more time before OOM, but mostly:

  • in the perfect world, we thought we could kill our custom code
  • the behavior we observe is different than described in the proposal (i.e. GC uses more CPU than expected).

@mknyszek: no problem for the delay; thanks for responding. (I’ll be offline soon for ~10 days too).

For instance, you set GOGC=10 and a memory target of whatever you set your memory limit to currently. The behavior this configuration would have is to use the memory target amount of memory until you get within 10% of the memory target. From then on, it behaves like GOGC=10.

Oh, this was not my understanding after reading the docs. Even the sliders in https://go.dev/doc/gc-guide suggest that setting GOGC=10 affects GC frequency long before arriving at GOMEMLIMIT (i.e. it doesn’t matter if my GOMEMLIMIT is 1 GB or 20 GB if my “live heap” usage is low - GC will run frequently). Am I misunderstanding?

I’ll run the same tests with GOGC=10 to observe the behavior.

The memory limit seems fundamentally a bad fit because what you want here is to OOM

Agreed. However, we cannot know which applications are leaking upfront, and any application might start leaking at any time. So it makes GOMEMLIMIT a “fundamentally bad fit” for any of our applications, which is really sad.

Yeah, the performance will be better when you’re close to the memory limit, but there’s still going to be some kind of degradation.

Yeah, the degradation is expected. As long as it’s bounded, it’s totally fine. In the cases we see above, the GC uses up to 80% of the available cores - and 10x more time than the application “itself”. This seems divergent from the proposal, right?

Understood that there are no easy/quick fixes here. My highest hope here was that maybe in Go 1.21/1.22 you’d magically come up with a fix that would limit the GC more (any of <25% of GOMAXPROCS, <100% of “real work”, <2 cores, $else).