go: runtime: mark assist blocks GC microbenchmark for 7ms
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (go version
)?
1.11
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
OSX (also observed on Linux), both AMD64.
What did you do?
This singlethreaded microbenchmark measures GC latency to allocate a byte slice and store it in a big circular buffer, repeating the operation 5 times the size of the big circular buffer (i.e. one initialization and four reuses). Live memory is about 210MB, in the form of the circular buffer (200,000 slice elements) and 200,000 pointer-free buffers of size 1k.
This version of the benchmark is instrumented to collect a trace, because that’s how we figured out that it was mark assist.
The original benchmark source is https://github.com/WillSewell/gc-latency-experiment, adapted here to be more instrumented and more plain-spoken about what it is doing.
package main
import (
"fmt"
"os"
"runtime/trace"
"time"
"unsafe"
)
const (
bufferLen = 200000
msgCount = 1000000
)
type kbyte []byte
type circularBuffer [bufferLen]kbyte
var worst time.Duration
// For making sense of the bad outcome.
var total time.Duration
var worstIndex int
var allStart time.Time
var worstElapsed time.Duration
// newSlice allocates a 1k slice of bytes and initializes them all to byte(n)
func newSlice(n int) kbyte {
m := make(kbyte, 1024)
for i := range m {
m[i] = byte(n)
}
return m
}
// storeSlice stores a newly allocated 1k slice of bytes at circularBuffer[count%bufferLen]
// (bufferLen is the length of the array circularBuffer)
// It also checks the time needed to do this and records the worst case.
func storeSlice(c *circularBuffer, highID int) {
start := time.Now()
c[highID%bufferLen] = newSlice(highID)
elapsed := time.Since(start)
candElapsed := time.Since(allStart) // Record location of worst in trace
if elapsed > worst {
worst = elapsed
worstIndex = highID
worstElapsed = candElapsed
}
total = time.Duration(total.Nanoseconds() + elapsed.Nanoseconds())
}
func main() {
trace.Start(os.Stderr)
allStart = time.Now()
defer trace.Stop()
var c circularBuffer
for i := 0; i < msgCount; i++ {
storeSlice(&c, i)
}
fmt.Println("Worst push latency: ", worst)
fmt.Println("Worst push index: ", worstIndex)
fmt.Println("Worst push occurs at run elapsed time: ", worstElapsed)
fmt.Println("Average push latency: ", time.Duration(total.Nanoseconds()/msgCount))
fmt.Println("Sizeof(circularBuffer) = ", unsafe.Sizeof(c))
fmt.Println("Approximate live memory = ", unsafe.Sizeof(c)+bufferLen*1024)
}
What did you expect to see?
I expected to see a sub-millisecond worst-case latency; the average time without GC to initialize a new slice is less about a microsecond (on a 2017 Mac laptop).
What did you see instead?
Worst-case latencies on the order of 4-10ms.
I’v attached the trace file for the following run:
go run hello.go 2> trace2.out
Worst push latency: 5.193319ms
Worst push index: 780507
Worst push occurs at run elapsed time: 995.334915ms
Average push latency: 1.018µs
Sizeof(circularBuffer) = 4800000
Approximate live memory = 209600000
The worst case latency ends at 995ms, corresponding to a single 5ms mark assist. A zoom of the trace displaying this is also attached.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 18 (12 by maintainers)
Commits related to this issue
- runtime: look for idle p to run current goroutine when switching to GC or traceReader This repairs one of the several causes of pauses uncovered by a GC microbenchmark. A pause can occur when a goro... — committed to golang/go by dr2chase 6 years ago
- fix (#1) * cmd/compile: add unsigned divisibility rules "Division by invariant integers using multiplication" paper by Granlund and Montgomery contains a method for directly computing divisibili... — committed to kiku-jw/go by kiku-jw 5 years ago
- cmd/compile: add -smallframes gc flag for GC latency diagnosis Shrinks the size of things that can be stack allocated from 10M to 128k for declared variables and from 64k to 16k for implicit allocati... — committed to golang/go by dr2chase 5 years ago
- cmd/compile: correct capitalization in recordFlags parameter Tool refactoring smallStacks into smallFrames helpfully "corrected" the capitalization in a string, this undoes the help. This is necessa... — committed to golang/go by dr2chase 5 years ago
- runtime: export total GC Assist ns in MemStats and GCStats At a high level, the runtime garbage collector can impact user goroutine latency in two ways. The first is that it pauses all goroutines dur... — committed to nvanbenschoten/go by nvanbenschoten 2 years ago
- benchmarks/gc_latency: add new microbenchmark for tricky mutator latency issues Gc_latency is a modified version of a program that tickled multiple latency glitches in the Go GC/runtime. This versio... — committed to golang/benchmarks by dr2chase 3 years ago
A bit more info: it turns out that in the original, the 4800000 byte circular buffer was allocated on the stack, and large stack frames are not handled incrementally in the same way that large objects are.
Modifying the benchmark to allocate the circular buffer and store the pointer in a global, the latency falls to 2ms, which is better, though still far worse than expected. In the snapshot from the trace, you can see that the GC work is now shared among several threads, but the worker thread is 100% running mark assist during that interval.
A different modification, to move the declaration of var c circularBuffer to a global, also shortens the worst case latency in the same way, also with the same 100% mark assist for about 2ms.
Still to investigate:
Once mark assist is dealt with, this microbenchmark is likely to have problems with long sweeps, a different, known bug ( #18155 ) that I’ve also seen in some of these traces. That looks like:
Change https://golang.org/cl/372256 mentions this issue:
benchmarks/gc_latency: add new microbenchmark for tricky mutator latency issues
@LK4D4 I just submitted a change to 1.13 tip that might help you diagnose this (depending on the complexity of your build process). Crudely,
go build -gcflags 'all=-smallframes' hello.go
will force larger things to be allocated on the heap instead of stack, which works around two flaws in allocating and accounting for mark assist work (flaw 1, stack scanning is a large indivisible chunk of work, flaw 2, mark assistant gets zero credit for doing stack scanning). If this makes your latency get better, next problem is to figure out which stack frames cause the problem. When you know that, you can disable stack allocation by leaking the address of the giant object (store it to a global, later nil the global – or call a global pointer to a no-op function passing it the address, that is friendlier to the race detector).You can look for differences by comparing output of
-gcflags 'all=-m -smallframes'
with-gcflags 'all=-m'
. That will be horribly verbose, though the differences should not. Anything appearing in the differences is a somewhat-large allocation.Austin’s unproven-but-plausible theory for this is that between 1.10 and 1.12 one of the incremental improvements in escape analysis let something large be allocated on the stack.