go: runtime: long pauses STW (sweep termination) on massive block allocation
What version of Go are you using (go version)?
~ ᐅ go version go version go1.12.1 darwin/amd64
What operating system and processor architecture are you using (go env)?
go env Output
~ ᐅ go env GOARCH="amd64" GOBIN="" GOCACHE="/Users/un0/Library/Caches/go-build" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOOS="darwin" GOPATH="/Users/un0/go" GOPROXY="" GORACE="" GOROOT="/usr/local/Cellar/go/1.12.1/libexec" GOTMPDIR="" GOTOOLDIR="/usr/local/Cellar/go/1.12.1/libexec/pkg/tool/darwin_amd64" GCCGO="gccgo" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/bf/sm9v_xtn4tj6xz_rmmwp5tzm0000gn/T/go-build942797956=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
We have a big entity cache inside an application. So it’s loading in background goroutine. And when the application allocates a slice in the background the latency inside all response handlers increases.
Got a reproducer
package main
import (
	"fmt"
	"math/rand"
	"os"
	"runtime/trace"
	"time"
)
type Rates struct {
	a, b, c, d float64
}
var maxWait time.Duration
func main() {
	const size = 100e6
	out, err := os.OpenFile("prof", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
	if err != nil {
		panic(err)
	}
	defer func() {
		_ = out.Sync()
		_ = out.Close()
	}()
	err = trace.Start(out)
	if err != nil {
		panic(err)
	}
	// simulate some work
	go loop()
	time.Sleep(3 * time.Second)
	// make a huge allocation in parallel
	go func() {
		_ = make([]Rates, size)		// << huge alloc
	}()
	time.Sleep(3 * time.Second)
	trace.Stop()
	fmt.Println("maxWait =", maxWait.String())
}
func loop() {
	for {
		now := time.Now()
		r := make([]Rates, 0)
		for i := 0; i < 100; i++ {
			r = append(r, Rates{
				rand.Float64(), rand.Float64(), rand.Float64(), rand.Float64(),
			})
		}
		d := time.Since(now)
		if maxWait < d {
			maxWait = d
		}
	}
}

What did you expect to see?
Worker gorotine(loop in example) should not pause more than 1-5ms.
What did you see instead?
43ms pauses
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 8
- Comments: 26 (24 by maintainers)
For 1.15, one of my planned follow-up tasks on non-cooperative preemption is to make memclrNoHeapPointers preemptible. For large pointer-free allocations, this should make the long zeroing operation preemptible.
This is going to be a good bit easier to do with the new allocation headers landing in Go 1.22. Punting to Go 1.23, but this time for sure. 😃
This issue is currently labeled as early-in-cycle for Go 1.22. That time is now, so a friendly reminder to look at it again.
The GC doesn’t work in partial objects. That is, it doesn’t track the “earliest” reference into a slice, just that there was some reference to the slice. And once the entire slice is retained by GC, it also retains everything it points to.
One could certainly imagine GC tracking the earliest pointer into a slice and only scanning the slice past that point (in fact, you could do this for all objects). This gets really tricky when there are multiple references to an object. For example, the GC could have already followed one reference and scanned the object past that point, only to later find a reference that points earlier in the object and either have to scan it again or track the point in each object it last scanned from. It’s all theoretically doable, but it wouldn’t pay for its own costs in most cases.
If you want GC to collect the unreferenced elements, set
q[0]to nil before slicing it. But that’s not going to help with the large allocations.To be clear, the reproducer is making a 3.2 GiB allocation which is going to take a while for the OS to fulfill.
However, I dug into this a bit and I’m seeing ~1s pauses, sometimes with only ~50ms pauses.
The 50ms pauses represent a lower bound for how fast the OS can give us back pages. The ~1s pauses actually come from the fact that we’re zeroing the entire 3.2 GiB allocation, even though it just came from the OS.
We have an optimization which avoids re-zeroing fresh pages, but unfortunately this optimization is conservative. For example, if a needzero span at the end of the heap is contiguous with the new 3.2 GiB allocation, then those two are going to coalesce, and the 3.2 GiB free space will be marked needzero, when really only the first N KiB of it actually needs to be zeroed.
One way we could fix this is by making needzero a number, and if we can guarantee that only the first N KiB need zeroing, then we only zero that. If it’s some weird swiss cheese of non-zeroed memory then we just assume the whole thing needs zeroing, rather than trying to keep track of which parts need zeroing.
I’m not sure if this fix is worth it though, since future allocations would still have to go back and zero this memory anyway, causing the 1s delay to return.
Also, I think it’s too late in this cycle to do anything here for Go 1.21 unfortunately (clearing memory that will soon contain pointers is subtle) but I will do something here for 1.22 early-in-cycle.
We’ve brainstormed a few ideas and I feel pretty OK about the idea of faking a span as noscan until its’ allocation is ready to be returned. We already do this kind of faking elsewhere, so we’re at least not adding too much new complexity.
I did some thinking about breaking up the zeroing of large pointerful objects. I think it’s doable, but it’s pretty subtle, and I might be missing something (there may be some GC interaction I’m not seeing, specifically).
It would look something like this:
There’s an additional detail here that we don’t update freeIndex until the second non-preemptible section, I think.
Setting the span as scannable late and non-atomically works because if it happens during a GC cycle (or if we transition into a GC cycle while zeroing) we’ll mark the object before returning, so it’ll never need to get scanned (and thus its class doesn’t matter). STWs also effectively global memory barriers so GC transitions are safe.
I think the only downside is how subtle this all is.
Just commenting since we’re also seeing this issue on some services internally. The worst outliers can be up to 15s:
Here is some detail on a 2s pause we were able to catch while running an execution trace.
This isn’t critical for us and we think we know how we could mitigate the problem by changing our code. I’m just commenting since it came up in a discussion with @prattmic today. My colleague @pmbauer did the investigation on this.
@mknyszek
I am using a tip version
go version devel go1.18-527609d47b Wed Aug 25 17:07:58 2021 +0200 darwin/arm64. The profile is collected from an app running on iOS.The GC pause reported by
runtime/metricspackage is as follows:Examples:
GC idle took 0.02 second:
GC fractional took 1.2 second and 1.8 second:
Sweep termination took 1.8 second:
I hope I found good examples, if not: there are more examples, see the trace profile:
https://drive.google.com/file/d/1OaMruMiu8RWBXclQDRuf7MGDkzjkR4PP/view?usp=sharing
I had a few running cases where the max observation was 4 seconds (didn’t collect trace profile for this case, but the reproducer should be able to observe that again):
The reproducer (depend on develop branch f52dd5f2461b4b6396df83a6a723037f74a75a72):