go: runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use
What version of Go are you using (go version)?
$ go1.13 version go version go1.13.3 darwin/amd64
Does this issue reproduce with the latest release?
Yes, this is present in Go 1.13 and in current tip with Linux 4.14. It seems to exist even back in Go 1.4, and with Linux 3.2 and 2.6.32.
Am I holding it wrong?
Around Go 1.6, a lot of the tests in runtime/pprof were called out as being flaky. It looks like around that same time, the builders got an overhaul. Maybe they moved to machines with more CPU cores than before, and the increase in flakiness was due to some SIGPROF deliveries being skipped?
The tests in runtime/pprof both now and around Go 1.6 seem to compare parts of the profile to itself, but not to the CPU usage reported by the operating system. If this is a real bug, those tests would not have discovered it.
What operating system and processor architecture are you using (go env)?
I’m compiling on darwin/amd64 and running on linux/amd64.
go env Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/rhys/Library/Caches/go-build" GOENV="/Users/rhys/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GONOPROXY="*" GONOSUMDB="*" GOOS="darwin" GOPATH="/Users/rhys/go" GOPRIVATE="*" GOPROXY="direct" GOROOT="/usr/local/go" GOSUMDB="off" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/49/zmds5zsn75z1283vtzxyfr5hj7yjq4/T/go-build343144681=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
I looked at a CPU profile for a Go program running on a Linux machine where top reported 20 cores of CPU usage (2000%), but the profile showed about 240% / 2.4 cores of usage.
I ran a test with the -test.cpuprofile flag on a Linux machine and compared the results of the time shell built-in with go tool pprof’s view of the usage. I varied the rate that the program asked the kernel to deliver SIGPROF and found that the two measurements agreed on the number of CPU cycles spent as long as there were fewer than 250 SIGPROF deliveries per second.
I ran the test under perf stat -e 'signal:*' and found that its count of signal:signal_generate events lined up with the number of SIGPROF deliveries I’d expect, that its count of signal:signal_deliver events lined up with the number of samples in the CPU profile, and that the two matched well only when the “generate” rate was less than 250 samples per second.
Here, the test uses 96 vCPUs of a machine with 96 hyperthreads for 10 seconds, using the Go runtime’s default profile rate of 100 Hz. The Linux kernel generates slightly less than 96,000 signals (which are probably all SIGPROF). The time built-in reports slightly less than 16 minutes (960 seconds) of “user” CPU. That’s good.
The resulting profile shows 10.20 seconds of wall-clock time and 1.61 minutes (about 96.6 seconds) of CPU time, or about 9660 samples at 100 Hz. That’s close to the number of signals that the kernel reports it delivered to the program, but that doesn’t match the number generated by the kernel or the actual CPU time spent.
$ time sudo perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
PASS
 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':
            95,690      signal:signal_generate                                      
             9,688      signal:signal_deliver                                       
      10.211210687 seconds time elapsed
real	0m10.316s
user	15m57.840s
sys	0m0.045s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:16pm (PDT)
Duration: 10.20s, Total samples = 1.61mins (949.83%)
Showing nodes accounting for 1.61mins, 100% of 1.61mins total
Dropped 13 nodes (cum <= 0.01mins)
      flat  flat%   sum%        cum   cum%
  1.61mins   100%   100%   1.61mins   100%  command-line-arguments.cpuHog1
         0     0%   100%   1.61mins   100%  command-line-arguments.TestCPUProfile.func1
         0     0%   100%   1.61mins   100%  command-line-arguments.cpuHogger
Calling runtime.SetCPUProfileRate with “2 Hz” right before the testing package’s CPU profile starts lets me dial the profile rate down to less than 250 Hz process-wide. (The warning message seems harmless in this case.) This leads to the kernel’s measurements of “signal:signal_generate” and “signal:signal_deliver” matching each other, and for go tool pprof’s measurement of “15.94mins” to come very close to what the time built-in sees at “user	15m57.048s”.
$ time sudo env PROFILE_HZ=2 perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
runtime: cannot set cpu profile rate until previous profile has finished.
PASS
 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':
             1,913      signal:signal_generate                                      
             1,913      signal:signal_deliver                                       
      10.120272912 seconds time elapsed
real	0m10.221s
user	15m57.048s
sys	0m0.378s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:17pm (PDT)
Duration: 10.11s, Total samples = 15.94mins (9464.52%)
Showing nodes accounting for 15.93mins, 99.90% of 15.94mins total
Dropped 1 node (cum <= 0.08mins)
      flat  flat%   sum%        cum   cum%
 15.93mins 99.90% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHog1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.TestCPUProfile.func1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHogger
I confirmed that the kernel was configured with high-resolution timers as recommended in #13841.
$ uname -a
Linux ip-172-31-18-196.us-west-2.compute.internal 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /boot/config-$(uname -r) | grep CONFIG_HIGH_RES_TIMERS
CONFIG_HIGH_RES_TIMERS=y
$ cat /boot/config-$(uname -r) | grep ^CONFIG_HZ
CONFIG_HZ_250=y
CONFIG_HZ=250
I’ve seen this effect both on virtual machines and on physical hardware. (Most of my follow-up testing has taken place on virtual machines.)
What did you expect to see?
I expected the number of seconds of CPU time reported by go tool pprof to align with the number of seconds of CPU time observed by the kernel.
When I run go tool pprof, I expect the time reported in the “Duration” line (like “Duration: 5.11s, Total samples = 8.50s (166.40%)”) to match what I’d see from looking at a tool like top at the same time.
What did you see instead?
The Linux kernel seems to drop SIGPROF events when they come more than 250 times per second. I don’t know if it drops them fairly—the profiles might be skewed.
Open questions
Is there a simple setting that my coworkers and I are missing? I’ve reproduced this with vanilla machine images for Ubuntu and Amazon Linux 2.
Is the right move for runtime.SetCPUProfileRate to limit its input to 250 / GOMAXPROCS?
Does the number “250” come from Linux’s CONFIG_HZ_250=y / CONFIG_HZ=250, and is it right for that configuration to end up compiled in to Go?
Thanks!
Here’s the test program:
package repro
import (
	"os"
	"runtime"
	"strconv"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)
var sink int64
func init() {
	hz, _ := strconv.Atoi(os.Getenv("PROFILE_HZ"))
	if hz > 0 {
		runtime.SetCPUProfileRate(hz)
	}
}
func TestCPUProfile(t *testing.T) {
	workers := runtime.GOMAXPROCS(0)
	var wg sync.WaitGroup
	for i := 0; i < workers; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			var v int
			cpuHogger(cpuHog1, &v, 10*time.Second)
			atomic.StoreInt64(&sink, int64(v))
		}()
	}
	wg.Wait()
}
// cpuHogger and cpuHog1 from src/runtime/pprof/pprof_test.go
func cpuHogger(f func(x int) int, y *int, dur time.Duration) {
	// We only need to get one 100 Hz clock tick, so we've got
	// a large safety buffer.
	// But do at least 500 iterations (which should take about 100ms),
	// otherwise TestCPUProfileMultithreaded can fail if only one
	// thread is scheduled during the testing period.
	t0 := time.Now()
	accum := *y
	for i := 0; i < 500 || time.Since(t0) < dur; i++ {
		accum = f(accum)
	}
	*y = accum
}
// The actual CPU hogging function.
// Must not call other functions nor access heap/globals in the loop,
// otherwise under race detector the samples will be in the race runtime.
func cpuHog1(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 1
		}
	}
	return foo
}
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 4
- Comments: 43 (36 by maintainers)
Commits related to this issue
- Add code from rhys see https://github.com/golang/go/issues/35057 — committed to felixge/dump by felixge 3 years ago
- runtime: add timer_create syscalls for Linux Updates #35057 Change-Id: Id702b502fa4e4005ba1e450a945bc4420a8a8b8c Reviewed-on: https://go-review.googlesource.com/c/go/+/342052 Run-TryBot: Rhys Hiltne... — committed to golang/go by rhysh 3 years ago
- runtime: allow per-OS changes to unix profiler Updates #35057 Change-Id: I56ea8f4750022847f0866c85e237a2cea40e0ff7 Reviewed-on: https://go-review.googlesource.com/c/go/+/342053 Run-TryBot: Rhys Hilt... — committed to golang/go by rhysh 3 years ago
- runtime: profile with per-thread timers on Linux Using setitimer on Linux to request SIGPROF signal deliveries in proportion to the process's on-CPU time results in under-reporting when the program u... — committed to golang/go by rhysh 3 years ago
- runtime: move sigprofNonGo The sigprofNonGo and sigprofNonGoPC functions are only used on unix-like platforms. In preparation for unix-specific changes to sigprofNonGo, move it (plus its close relati... — committed to golang/go by rhysh 3 years ago
- runtime: use per-thread profiler for SetCgoTraceback platforms Updates #35057 Change-Id: I61d772a2cbfb27540fb70c14676c68593076ca94 Reviewed-on: https://go-review.googlesource.com/c/go/+/342054 Run-T... — committed to golang/go by rhysh 3 years ago
Actually, the new timers aren’t in use on amd64 until http://golang.org/cl/342054, so it makes sense that the test was failing (i.e., it was added a bit too early). Since the CL is in, we don’t expect any more failures.
While we’re at it, I believe this issue is now complete?
I did the work to split Linux profiling out from the generic Unix profiling code, and added the syscalls to all the architectures that Go+Linux supports, so CL324129 PS3 now (1) compiles and (2) works well enough to run my reproducer test from the top of this issue. Setting
GODEBUG=pproftimercreate=1enables thetimer_create/CLOCK_THREAD_CPUTIME_ID-based profiler. It seems to work for GOARCH of 386, amd64, arm, and arm64.Here are some results I observed on a 64-core arm64 / Aarch64 machine (with 64 real cores and no hyperthreading / SMT). Recall that “cpuHog1” would be better named “work done serially on one thread”, that “cpuHog2” is “work done across all available threads in parallel”, and that those functions each do the same amount of total work. The CL needs some refinement (what to do when a new M starts during profiling, etc), but I think this shows that (1) the skew in
setitimercan be severe on readily-available hardware (here, a c6g.metal on EC2), and (2)timer_createseems to fix it.CC @felixge
First with PS3 but GODEBUG=pproftimercreate=0 to get old setitimer profiling:
Then with the same executable with the new timer_create profiler via GODEBUG=pproftimercreate=1:
I don’t think we need to bother. It will be pretty obvious in tests if the flag is disabled.
@rhysh thanks for all the detailed analysis on this. I think there is quite a compelling argument, and that we probably want this. I plan to review your CLs this week (sorry for the delay! Just got back from vacation.).
A few notes/thoughts I have: