go: runtime: long pauses STW (sweep termination) on massive block allocation

What version of Go are you using (go version)?

~ ᐅ go version
go version go1.12.1 darwin/amd64

What operating system and processor architecture are you using (go env)?

go env Output
~ ᐅ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/un0/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/un0/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12.1/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12.1/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/bf/sm9v_xtn4tj6xz_rmmwp5tzm0000gn/T/go-build942797956=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

We have a big entity cache inside an application. So it’s loading in background goroutine. And when the application allocates a slice in the background the latency inside all response handlers increases.

Got a reproducer

package main

import (
	"fmt"
	"math/rand"
	"os"
	"runtime/trace"
	"time"
)

type Rates struct {
	a, b, c, d float64
}

var maxWait time.Duration

func main() {
	const size = 100e6

	out, err := os.OpenFile("prof", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
	if err != nil {
		panic(err)
	}
	defer func() {
		_ = out.Sync()
		_ = out.Close()
	}()
	err = trace.Start(out)
	if err != nil {
		panic(err)
	}

	// simulate some work
	go loop()

	time.Sleep(3 * time.Second)
	// make a huge allocation in parallel
	go func() {
		_ = make([]Rates, size)		// << huge alloc
	}()
	time.Sleep(3 * time.Second)

	trace.Stop()

	fmt.Println("maxWait =", maxWait.String())
}

func loop() {
	for {
		now := time.Now()
		r := make([]Rates, 0)
		for i := 0; i < 100; i++ {
			r = append(r, Rates{
				rand.Float64(), rand.Float64(), rand.Float64(), rand.Float64(),
			})
		}
		d := time.Since(now)
		if maxWait < d {
			maxWait = d
		}
	}
}

image

What did you expect to see?

Worker gorotine(loop in example) should not pause more than 1-5ms.

What did you see instead?

43ms pauses

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 8
  • Comments: 26 (24 by maintainers)

Most upvoted comments

For 1.15, one of my planned follow-up tasks on non-cooperative preemption is to make memclrNoHeapPointers preemptible. For large pointer-free allocations, this should make the long zeroing operation preemptible.

This is going to be a good bit easier to do with the new allocation headers landing in Go 1.22. Punting to Go 1.23, but this time for sure. 😃

This issue is currently labeled as early-in-cycle for Go 1.22. That time is now, so a friendly reminder to look at it again.

Is there a particular reason that GC does not collect the those unreferenced elements in the slice?

The GC doesn’t work in partial objects. That is, it doesn’t track the “earliest” reference into a slice, just that there was some reference to the slice. And once the entire slice is retained by GC, it also retains everything it points to.

One could certainly imagine GC tracking the earliest pointer into a slice and only scanning the slice past that point (in fact, you could do this for all objects). This gets really tricky when there are multiple references to an object. For example, the GC could have already followed one reference and scanned the object past that point, only to later find a reference that points earlier in the object and either have to scan it again or track the point in each object it last scanned from. It’s all theoretically doable, but it wouldn’t pay for its own costs in most cases.

If you want GC to collect the unreferenced elements, set q[0] to nil before slicing it. But that’s not going to help with the large allocations.

To be clear, the reproducer is making a 3.2 GiB allocation which is going to take a while for the OS to fulfill.

However, I dug into this a bit and I’m seeing ~1s pauses, sometimes with only ~50ms pauses.

The 50ms pauses represent a lower bound for how fast the OS can give us back pages. The ~1s pauses actually come from the fact that we’re zeroing the entire 3.2 GiB allocation, even though it just came from the OS.

We have an optimization which avoids re-zeroing fresh pages, but unfortunately this optimization is conservative. For example, if a needzero span at the end of the heap is contiguous with the new 3.2 GiB allocation, then those two are going to coalesce, and the 3.2 GiB free space will be marked needzero, when really only the first N KiB of it actually needs to be zeroed.

One way we could fix this is by making needzero a number, and if we can guarantee that only the first N KiB need zeroing, then we only zero that. If it’s some weird swiss cheese of non-zeroed memory then we just assume the whole thing needs zeroing, rather than trying to keep track of which parts need zeroing.

I’m not sure if this fix is worth it though, since future allocations would still have to go back and zero this memory anyway, causing the 1s delay to return.

Also, I think it’s too late in this cycle to do anything here for Go 1.21 unfortunately (clearing memory that will soon contain pointers is subtle) but I will do something here for 1.22 early-in-cycle.

We’ve brainstormed a few ideas and I feel pretty OK about the idea of faking a span as noscan until its’ allocation is ready to be returned. We already do this kind of faking elsewhere, so we’re at least not adding too much new complexity.

I did some thinking about breaking up the zeroing of large pointerful objects. I think it’s doable, but it’s pretty subtle, and I might be missing something (there may be some GC interaction I’m not seeing, specifically).

It would look something like this:

  1. Have spanAlloc set the initial class for a large scan span to noscan.
  2. spanAlloc does an atomic span state write, then publishes the span to the GC, as usual.
  3. After returning from spanAlloc (via largeAlloc), mallocgc becomes preemptible.
  4. mallocgc zeroes the span.
  5. mallocgc becomes non-preemptible again.
  6. mallocgc writes out the pointer bitmap.
  7. mallocgc sets the span as scannable.

There’s an additional detail here that we don’t update freeIndex until the second non-preemptible section, I think.

Setting the span as scannable late and non-atomically works because if it happens during a GC cycle (or if we transition into a GC cycle while zeroing) we’ll mark the object before returning, so it’ll never need to get scanned (and thus its class doesn’t matter). STWs also effectively global memory barriers so GC transitions are safe.

I think the only downside is how subtle this all is.

Just commenting since we’re also seeing this issue on some services internally. The worst outliers can be up to 15s:

CleanShot 2023-03-30 at 14 16 16

Here is some detail on a 2s pause we were able to catch while running an execution trace.

CleanShot 2023-03-30 at 14 08 02

This isn’t critical for us and we think we know how we could mitigate the problem by changing our code. I’m just commenting since it came up in a discussion with @prattmic today. My colleague @pmbauer did the investigation on this.

@mknyszek

I am using a tip version go version devel go1.18-527609d47b Wed Aug 25 17:07:58 2021 +0200 darwin/arm64. The profile is collected from an app running on iOS.

The GC pause reported by runtime/metrics package is as follows:

1.2032e-05:	*
2.2528e-05:	*
3.2768e-05:	**
3.3792e-05:	***
3.4816e-05:	***
3.584e-05:	**
3.6864e-05:	**
3.7888e-05:	**
3.8912e-05:	*
3.9936e-05:	*
4.1984e-05:	*
4.608e-05:	*
4.7104e-05:	*
8.3968e-05:	*
0.000139264:	*
0.010747904:	*
0.012320768:	*
0.017301504:	*
0.01835008:	**
0.019398656:	**
0.019922944:	*
0.020447232:	*
0.02097152:	*
0.021495808:	**
0.022020096:	**
0.02359296:	*
0.037748736:	*
0.040894464:	*
0.042991616:	*
0.044040192:	*
0.04718592:	*
0.049283072:	*
0.071303168:	**
0.075497472:	***
0.081788928:	*
0.08388608:	*
0.090177536:	*
0.096468992:	**
0.142606336:	*
0.14680064:	**
1.476395008:	*

Examples:

GC idle took 0.02 second:

image

GC fractional took 1.2 second and 1.8 second:

image image

Sweep termination took 1.8 second:

image

I hope I found good examples, if not: there are more examples, see the trace profile:

https://drive.google.com/file/d/1OaMruMiu8RWBXclQDRuf7MGDkzjkR4PP/view?usp=sharing


I had a few running cases where the max observation was 4 seconds (didn’t collect trace profile for this case, but the reproducer should be able to observe that again):

telegram-cloud-photo-size-2-5289949315852252757-y


The reproducer (depend on develop branch f52dd5f2461b4b6396df83a6a723037f74a75a72):

package main

import (
	"image/color"
	"time"

	"fyne.io/fyne/v2"
	"fyne.io/fyne/v2/app"
	"fyne.io/fyne/v2/canvas"
	"fyne.io/fyne/v2/container"
	"fyne.io/fyne/v2/layout"
	"fyne.io/fyne/v2/test"
	"fyne.io/fyne/v2/widget"
)

func main() {
	app := app.NewWithID("iosrend")
	win := app.NewWindow("Some Window")

	cells := make([]fyne.CanvasObject, 0, 80*25)
	for i := 0; i < 80; i++ {
		for j := 0; j < 25; j++ {
			cells = append(cells, canvas.NewRectangle(color.RGBA{R: 255, G: 255, B: 255, A: 128}))
		}
	}
	container := container.New(layout.NewGridLayout(25), cells...)

	c := 0
	b := widget.NewButton("START", func() {
		for i := 0; i < 80; i++ {
			for j := 0; j < 25; j++ {
				index := i*25 + j
				obj, _ := container.Objects[index].(*canvas.Rectangle)
				if c == 0 {
					obj.FillColor = color.RGBA{R: 20, G: 30, B: 100, A: 128}
				} else if c == 1 {
					obj.FillColor = color.RGBA{R: 100, G: 20, B: 30, A: 128}
				} else if c == 2 {
					obj.FillColor = color.RGBA{R: 30, G: 100, B: 20, A: 128}
				}
				obj.Refresh()
			}
		}
		c = (c + 1) % 3
	})
	go func() {
		tk := time.NewTicker(50 * time.Millisecond)
		for range tk.C {
			test.Tap(b)
		}
	}()

	win.SetContent(fyne.NewContainerWithLayout(layout.NewBorderLayout(nil, b, nil, nil), container, b))
	win.ShowAndRun()
}