go: runtime: GC causes latency spikes

Hello, while running the program at https://github.com/raintank/raintank-metric/tree/master/metric_tank I’m seeing mark-and-scan times of ~15s cpu time, 2000~2500 ms clock time. (8 core system) for a heap of about 6.5GB (STW pauses are fine and ~1ms) I used https://circleci.com/gh/raintank/raintank-metric/507 to obtain the data below.

$ metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
$ go version
go version go1.6 linux/amd64
$ uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i’ve gotten allocations down in various parts of the program there are currently probably about a million or more live pointers referencing pieces of data. I was going to work on optimizing this when Dave Cheney suggested there’s a problem with the runtime and I should file a bug (https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg) Here’s the log with gctrace and schedtrace enabled: https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you’ll find a zip containing this log, the binary, a cpu profile taken during gc run 1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just after the spike induced by GC run 1482) https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC shows memory usage, GC runs and STW pause times. it also shows that incoming load (requests) of the app is constant so this conveys to me that any extra load is caused by GC, not by changing workload https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99 this shows the system stats. note the cpu spikes corresponding to the GC workload.

let me know if there’s anything else I can provide, thanks, Dieter.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 1
Comments: 120 (78 by maintainers)

Commits related to this issue

runtime: reduce GC assist extra credit Mutator goroutines that allocate memory during the concurrent mark phase are required to spend some time assisting the garbage collector. The magnitude of this ... — committed to golang/go by rhysh 8 years ago
runtime: separate soft and hard heap limits Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative... — committed to golang/go by aclements 7 years ago
runtime: allow 5% mutator assist over 25% background mark Currently, both the background mark worker and the goal GC CPU are both fixed at 25%. The trigger controller's goal is to achieve the goal CP... — committed to golang/go by aclements 7 years ago
runtime: flush assist credit on goroutine exit Currently dead goroutines retain their assist credit. This credit can be used if the goroutine gets recycled, but in general this can make assist pacing... — committed to golang/go by aclements 8 years ago
cmd/trace: add minimum mutator utilization (MMU) plot This adds an endpoint to the trace tool that plots the minimum mutator utilization curve using information on mark assists and GC pauses from the... — committed to golang/go by aclements 7 years ago
runtime: export total GC Assist ns in MemStats and GCStats At a high level, the runtime garbage collector can impact user goroutine latency in two ways. The first is that it pauses all goroutines dur... — committed to nvanbenschoten/go by nvanbenschoten 2 years ago

Most upvoted comments

Just wanted to follow up with the resolution we eventually settled on at Laserlike (@stbaker’s company).

To sum up, the one-size-fits-all GC is absolutely not fitting us. I think these are the relevant features of our use case:

High base load of memory, which the non-generational GC must completely scan every cycle.
CPU-intensive background processes that also allocate lots of temporary memory.
CPU-intensive request handling that needs to be responsive.
We’re using SetGCPercent(20), but note that the problem does not go away by setting this to 100 (and doubling the high base memory would be a very expensive “fix”, anyway).

I believe the intention of how the GC would behave is that it would steal assist time in a balanced way, ramping up gradually as it loses ground, slowing down the program just enough so that it can complete its cycle before the specified memory growth % is reached. Graceful degradation like this would be fine. In practice, what is happening is the GC triggers late, panics, and completely usurps all CPU for seconds at a time until it finishes its expensive scan. I don’t know enough about how assist works to hazard a guess why, but there is never a gradual ramp up of assist time; it goes from 0 to 95%+ of CPU almost instantly. Golang is proud of its tiny Stop-The-World pauses, but there is very little functional difference between a STW pause and taking over all CPU cycles. From that perspective, our pauses are actually 5s or more, which makes it impossible to function as a performant server.

Thanks to everyone (Marko Kevac, @petermattis, @Dieterbe) who posted with the solution of editing the Go runtime. While we were at it, we actually added several knobs:

A function to turn assist off
A function to adjust gcGoalUtilization of CPU from its default of 0.25
A function to tell when the GC is running (exposing gcphase)

Simply turning assist off wasn’t quite enough, as the GC was not able to keep up and we would OOM under load. Some amortized speed degradation was clearly necessary; we just needed it to be spread out rather than hitting us in one massive spike. So we also changed gcGoalUtilization to 0.5 and had the background workers stop whenever they noticed the GC was running. This was enough to completely, dramatically solve our latency issues:

It’s an admirable goal to have a GC that “just works” for everyone and doesn’t need tuning, but this is an ideal, and possibly not achievable. Until then, I think it’s vital to add some more “knobs” to the GC (and discussions of it need to be more open about the downsides of its design choices - in the top 10 Google results for “golang garbage collection”, assist is mentioned exactly once, without definition or elaboration). It hurts the adoption of Go if a scenario like this (that should be solvable) requires editing the Go runtime itself.

+11

dkisman-laserlike on Aug 25, 2017

@wendigo: go tool trace. Unfortunately, there’s not a lot of documentation on it, but if you run go tool trace -h it will give you a high-level overview.

aclements on Sep 22, 2016

@dr2chase This is not quite a “benchmark” you probably have in mind, but it demonstrates the latency problem https://github.com/larytet-go/map-benchmark This is a simple HTTP server. When allocated memory reaches ~3GB the server stops responding for ~5s. I keep the system around 0% idle time. Arguably the demo - in-memory data base - is not a typical Go micro service. The service is CPU and DDR bound.

The issue is in no way a new one. I assume that the behavior is by design. If GC sees that the system remains busy it will stay put (GC “marker” lags allocations) until the amount of allocated memory crosses a threshold (GCPercent?). The threshold forces the GC’s hand. GC will consume all available cores, release the allocated memory, return to hibernation (background). I think that I understand some of the trade offs. GC’s main goal is to let the system serve occasional short spikes with minimal latency impact (?). Devops is expected to add nodes until the latency distribution hits the target.

I do have a strong opinion about this. The GC is excellent when the machine has CPUs to spare and/or the service waits for I/O a lot. I personally prefer predictable latency with well defined and reasonable worst case. I wonder why there is no a separate list for long living objects, const pointers, access to runtime/unsafe malloc/free, a buddy allocator in the standard library, more GC tuning options, GC feedback and so on. Is it possible to avoid scanning 10M pointers in a map each and every time GC wakes up? Should I use rate limiter in all services? I agree that my opinion is not a very educated one.

In the trace on a 4 core system I see

SCHED 119544ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
GC forced
gc 15 @120.185s 0%: 0.011+1.8+0.006 ms clock, 0.045+0/1.8/1.5+0.027 ms cpu, 2->2->1 MB, 4 MB goal, 4 P
SCHED 120557ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]
SCHED 120050ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
gc 15 @113.396s 6%: 0.010+7082+0.005 ms clock, 0.043+8.9/7071/18652+0.020 ms cpu, 2977->3054->1663 MB, 3245 MB goal, 4 P
SCHED 121066ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]

Links https://github.com/golang/go/issues/9477 Probably relevant https://github.com/golang/go/issues/27410 GC performance http://big-elephants.com/2018-09/unexpected-gc-pauses/

larytet on Aug 9, 2019

Go 1.18 introduced a new GC pacer (that reduces the amount of assists) and 1.19 introduced GOMEMLIMIT (I saw GOMAXHEAP mentioned somewhere earlier). We’ve also bounded sweeping on the allocation path back in Go 1.15, I believe. Skimming over the issue, I get the impression that there’s a chance some or all of the issues that have remained here, beyond what was already fixed earlier. It’s possible that may not be the case, but many of the sub-threads are fairly old.

~I’m inlined to put this into WaitingForInfo unless anyone here wants to chime in with an update. We can always file new issues if it turns out something remains (and it’ll probably be clearer and easier to manage than continuing a conversation that started halfway through this issue 😃).~ EDIT: It’s already in WaitingForInfo. In that case, this is just an update.

mknyszek on Sep 19, 2022

“Turning mark assist off” refers to the theft-of-time toggles for the graph; if you don’t count mark-assist “theft” of mutator time, the worst-case delay is below 10ms, and the longest window in which only 50% of the cpu is available is about 25-30ms.

Turning mark assist “on” (in the accounting), the 50% point is pushed out into seconds.

The chart shows “minimum mutator utilization” (MMU) and is available from “go tool trace”. The time on the bottom (x-axis) is the size of a window that is slid across the trace, and the fraction (y-axis) is the minimum for all windows (of that size) fraction of time in which the mutator wanted to run but could not. The toggle buttons (“Include”) allow you to specify things that are not mutation – if it’s one of those, then the mutator is “not running”. “View” allows you to specify either the entire system or worst-case per-goroutine (“system” runs the window across the threads in a stack, takes the fraction for the entire stack, “Per-goroutine” is instead thread-by-thread).

By default the graph shows worst case, and toggling “Percentile” is supposed to get you more nuanced numbers, but I think there is some sort of a bug because I’m not seeing that no matter how hard I fiddle.

One thing to be careful of is that this is a distillation of your app’s memory allocation, and it is quite possible that it sees mark assist problems that your app will not. Spending a lot of time in mark assist happens when the mutator (your app) is allocating quickly during GC – when that happens a tax is applied to allocations to prevent the GC from falling too far behind. If your app does more work that is not allocation – or even if it does a lot more allocations but they are often smaller than in this benchmark – then you won’t see the mark assist problem.

Spending a lot of time in sweep (your original problem) is a property of a heap that is very well-filled; it’s large, and it’s 100% live objects. Right after GC completes, a “background sweeper” thread starts to figure out where free space is, but if the mutator allocates before that thread has processed very much memory, then the mutator has to go looking (sweeping) instead. When a lot of the heap is 100% filled, it can look for a while. The patch for that simply says “after you’ve looked at 100 spans with no luck, give up and allocate a new span”.

So, if you’re up for patching a Go release, you might try https://go-review.googlesource.com/c/go/+/187817/ and see if your latency problem goes away (meaning, your real app does other stuff), or if it only gets a little better (you app allocates a lot anyway and we need to improve mark assist.)

Thanks very much for this benchmark. This gives us a nice reproducible case (taken from a real-world app) of sweep being a problem, which will probably get that fix into 1.15, and it provides a nice motivating case for improving mark assist.

dr2chase on Feb 19, 2020

In my experience keeping a large map (10M entries) seriously impacts the system latency. Under heavy loads the typical response time is under 100 microseconds. When GC kicks in, the system stops responding altogether for 3-4 seconds. It happens every ~5 minutes. The system has 4 cores. In the gctrace I see that GC frees ~8GB of memory and it takes most of 4s. In this scenario GC consumes all cores.

The flow is: default HTTP server gets a short HTTP GET (under 200 bytes), handler looks in the map (syncmap), generates a short response (under 200 bytes). The map gets a new entry or two every few hours.

In one approach GC could avoid scanning objects which appear to live long time. Low level malloc/free API in the unsafe package could help me as well.

I switched to BigCache which solved the problem.

larytet on Aug 7, 2019

please check repro code here: https://github.com/genez/collectiontest please note this is extremely simplified, production code does more computations (json and string handling, logging, etc…) that lead to >50ms latency

in production code we are not using stdlib maps because they do not shrink and cause OOM panics, we use a custom data structure (an rb+ tree where we serialize the items as a byte slice)

on my test bench (Intel i9 16 core, 32 GB of RAM) I can repro >10ms latency go version go1.13 windows/amd64

it looks like the platform (win, linux) does not matter

genez on Feb 18, 2020

I don’t have precisely the same machine setup as previously, so the numbers below differ from ones I’ve reported earlier. Before running the tests below, I let my load generator run as fast as possible to see the capacity of the system. Then I limited the rate of the load generator to ~50% of the max. Here are the numbers for a 3-node cockroach cluster compiled with go1.9 vs go-tip (devel +33c246f Tue Nov 7 14:18:00 2017 +0000):

go1.9:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14760.2      2.4      0.8     13.6     37.7

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14998.5      1.1      0.8      2.9      8.9

Big thumbs up on the reduction in 95%-tile and 99%-tile latencies. For reference, if I don’t rate limit the load generator I see:

go1.9:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     27537.1      7.0      2.4     26.2     52.4

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     27368.3      7.0      2.8     27.3     41.9

From the rate limited runs, here are a few lines of gctrace output from go1.9 compiled binary:

gc 299 @270.168s 3%: 0.28+94+0.13 ms clock, 1.9+226/188/0+0.95 ms cpu, 399->410->209 MB, 420 MB goal, 8 P
gc 300 @271.634s 3%: 0.26+108+0.10 ms clock, 2.1+197/216/0+0.83 ms cpu, 398->409->211 MB, 419 MB goal, 8 P
gc 301 @273.202s 3%: 0.21+104+0.11 ms clock, 1.7+196/207/0+0.93 ms cpu, 401->411->210 MB, 422 MB goal, 8 P

And from go-tip:

gc 285 @270.125s 2%: 0.041+120+0.14 ms clock, 0.33+151/231/48+1.1 ms cpu, 417->436->220 MB, 435 MB goal, 8 P
gc 286 @271.713s 2%: 0.091+117+0.16 ms clock, 0.73+19/234/139+1.3 ms cpu, 419->434->216 MB, 441 MB goal, 8 P
gc 287 @273.322s 2%: 0.061+120+0.12 ms clock, 0.48+115/240/74+0.96 ms cpu, 415->430->216 MB, 433 MB goal, 8 P

petermattis on Nov 7, 2017

@dkisman-laserlike can you please share your patch?

mkevac on Aug 26, 2017

I spent much of this week working on this. I still have ideas on my list, but wanted to post an update.

First, I was able to bring up the whole system with a custom runtime and observe the behavior through Grafana and vegeta report. I don’t think I was able to quite reproduce the problem locally, though. My laptop simply didn’t have the resources, so it was swapping and over-saturating the CPU and causing all sorts of horrible latency. My workstation had the resources (though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!), but I’m not sure it exhibited the problem. According to vegeta, max latency was ~350ms, but P99 was just 2ms in several runs. I did see small jumps in max “request handle duration” correlated with GC in Grafana, though they were only around 15ms.

However, I was still able to dig into the execution traces and glean some insight. I fixed some problems with high goroutine churn potentially losing track of assist credit, and with the way we build up spare assist credit causing long assists when the assist ratio is high (which it is in Metric Tank). I also did several more experimental changes, which did get rid of the bad checkerboard pattern in the execution trace and break it up into a much more finely interleaved workload. This is certainly what we want to see in the trace, but it didn’t seem to move the overall stats. I have a few more ideas on why that may be, but I’ll be out next week, so it may be a little bit before I can try them.

I’m going to mail a bunch of my experimental commits, which should get linked here shortly. @Dieterbe, if you have time, it would be good try running with these, though I’m not sure you’ll see much difference.

I did notice one thing you may be able to change on the application side to mitigate some of the problem. I think Metric Tank is allocating very small objects at a very high rate. This is causing the GC scan rate to be much lower than typical, since the memory access pattern is even more random than typical. This is stretching out the GC assists and interacting poorly with some of the tuning that assumes a more typical scan rate. It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

aclements on Jul 1, 2016

I dug into the trace you posted (thanks!) and took another look at your gctrace. This is pretty clearly an over-assist problem. When the GC is running, the expected 25% of the CPU goes to dedicated GC workers, but nearly all of the rest of the CPU time goes to GC assists triggered by allocation. Hence, when GC is running, even though it’s technically concurrent, nearly all of the CPU is being soaked up by GC. You can see this really clearly in the trace:

trace-14812

The GC kicks in in the middle. When the GC isn’t running, all of the user goroutine executions are really short and the system keeps up. As soon as GC kicks in, two Ps go to dedicated gcBgMarkWorkers (expected) and everything else turns to a checkerboard of 10–20 ms long user goroutine time slices (not expected). Because of this, the run queue lengths climb rapidly (the second row in the group of three at the top), which is almost certainly what’s causing the latency spikes. Right after GC finishes there’s a period where the user goroutines are running quickly again but they’re packed much more densely as it drains the long run queues and returns to normal.

I don’t yet understand why this is happening. The heap is in steady state, so I would expect it to compute an assist ratio that causes it to finish roughly half way between the trigger and the goal, but it seems to be computing a much higher assist ratio and finishing much earlier.

@Dieterbe, could you re-run your application with GODEBUG=gctrace=1,gcpacertrace=1 and grab stderr? It would be awesome if you can run with master, since there’s more gcpacertrace output than there was in 1.6, but even the 1.6 output would be helpful.

aclements on Jun 7, 2016

@Dieterbe, it’s possible. Could you try the fix I posted for #10345? (https://golang.org/cl/23540)

Note that it’s not that the GC thread blocks the map. Mutators are free to read and write the map while GC is scanning it; there’s no synchronization on the map itself. The issue is that whatever thread gets picked to scan the buckets array of the map is stuck not being able to do anything else until it’s scanned the whole bucket array. If there’s other mutator work queued up on that thread, it’s blocked during this time.

(Sorry I haven’t had a chance to dig into the trace you sent.)

aclements on May 31, 2016