go: runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use

What version of Go are you using (go version)?

$ go1.13 version
go version go1.13.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes, this is present in Go 1.13 and in current tip with Linux 4.14. It seems to exist even back in Go 1.4, and with Linux 3.2 and 2.6.32.

Am I holding it wrong?

Around Go 1.6, a lot of the tests in runtime/pprof were called out as being flaky. It looks like around that same time, the builders got an overhaul. Maybe they moved to machines with more CPU cores than before, and the increase in flakiness was due to some SIGPROF deliveries being skipped?

The tests in runtime/pprof both now and around Go 1.6 seem to compare parts of the profile to itself, but not to the CPU usage reported by the operating system. If this is a real bug, those tests would not have discovered it.

What operating system and processor architecture are you using (go env)?

I’m compiling on darwin/amd64 and running on linux/amd64.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/rhys/Library/Caches/go-build"
GOENV="/Users/rhys/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY="*"
GONOSUMDB="*"
GOOS="darwin"
GOPATH="/Users/rhys/go"
GOPRIVATE="*"
GOPROXY="direct"
GOROOT="/usr/local/go"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/49/zmds5zsn75z1283vtzxyfr5hj7yjq4/T/go-build343144681=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I looked at a CPU profile for a Go program running on a Linux machine where top reported 20 cores of CPU usage (2000%), but the profile showed about 240% / 2.4 cores of usage.

I ran a test with the -test.cpuprofile flag on a Linux machine and compared the results of the time shell built-in with go tool pprof’s view of the usage. I varied the rate that the program asked the kernel to deliver SIGPROF and found that the two measurements agreed on the number of CPU cycles spent as long as there were fewer than 250 SIGPROF deliveries per second.

I ran the test under perf stat -e 'signal:*' and found that its count of signal:signal_generate events lined up with the number of SIGPROF deliveries I’d expect, that its count of signal:signal_deliver events lined up with the number of samples in the CPU profile, and that the two matched well only when the “generate” rate was less than 250 samples per second.


Here, the test uses 96 vCPUs of a machine with 96 hyperthreads for 10 seconds, using the Go runtime’s default profile rate of 100 Hz. The Linux kernel generates slightly less than 96,000 signals (which are probably all SIGPROF). The time built-in reports slightly less than 16 minutes (960 seconds) of “user” CPU. That’s good.

The resulting profile shows 10.20 seconds of wall-clock time and 1.61 minutes (about 96.6 seconds) of CPU time, or about 9660 samples at 100 Hz. That’s close to the number of signals that the kernel reports it delivered to the program, but that doesn’t match the number generated by the kernel or the actual CPU time spent.

$ time sudo perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
PASS

 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':

            95,690      signal:signal_generate                                      
             9,688      signal:signal_deliver                                       

      10.211210687 seconds time elapsed


real	0m10.316s
user	15m57.840s
sys	0m0.045s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:16pm (PDT)
Duration: 10.20s, Total samples = 1.61mins (949.83%)
Showing nodes accounting for 1.61mins, 100% of 1.61mins total
Dropped 13 nodes (cum <= 0.01mins)
      flat  flat%   sum%        cum   cum%
  1.61mins   100%   100%   1.61mins   100%  command-line-arguments.cpuHog1
         0     0%   100%   1.61mins   100%  command-line-arguments.TestCPUProfile.func1
         0     0%   100%   1.61mins   100%  command-line-arguments.cpuHogger

Calling runtime.SetCPUProfileRate with “2 Hz” right before the testing package’s CPU profile starts lets me dial the profile rate down to less than 250 Hz process-wide. (The warning message seems harmless in this case.) This leads to the kernel’s measurements of “signal:signal_generate” and “signal:signal_deliver” matching each other, and for go tool pprof’s measurement of “15.94mins” to come very close to what the time built-in sees at “user 15m57.048s”.

$ time sudo env PROFILE_HZ=2 perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
runtime: cannot set cpu profile rate until previous profile has finished.
PASS

 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':

             1,913      signal:signal_generate                                      
             1,913      signal:signal_deliver                                       

      10.120272912 seconds time elapsed


real	0m10.221s
user	15m57.048s
sys	0m0.378s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:17pm (PDT)
Duration: 10.11s, Total samples = 15.94mins (9464.52%)
Showing nodes accounting for 15.93mins, 99.90% of 15.94mins total
Dropped 1 node (cum <= 0.08mins)
      flat  flat%   sum%        cum   cum%
 15.93mins 99.90% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHog1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.TestCPUProfile.func1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHogger

I confirmed that the kernel was configured with high-resolution timers as recommended in #13841.

$ uname -a
Linux ip-172-31-18-196.us-west-2.compute.internal 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /boot/config-$(uname -r) | grep CONFIG_HIGH_RES_TIMERS
CONFIG_HIGH_RES_TIMERS=y
$ cat /boot/config-$(uname -r) | grep ^CONFIG_HZ
CONFIG_HZ_250=y
CONFIG_HZ=250

I’ve seen this effect both on virtual machines and on physical hardware. (Most of my follow-up testing has taken place on virtual machines.)

What did you expect to see?

I expected the number of seconds of CPU time reported by go tool pprof to align with the number of seconds of CPU time observed by the kernel.

When I run go tool pprof, I expect the time reported in the “Duration” line (like “Duration: 5.11s, Total samples = 8.50s (166.40%)”) to match what I’d see from looking at a tool like top at the same time.

What did you see instead?

The Linux kernel seems to drop SIGPROF events when they come more than 250 times per second. I don’t know if it drops them fairly—the profiles might be skewed.


Open questions

Is there a simple setting that my coworkers and I are missing? I’ve reproduced this with vanilla machine images for Ubuntu and Amazon Linux 2.

Is the right move for runtime.SetCPUProfileRate to limit its input to 250 / GOMAXPROCS?

Does the number “250” come from Linux’s CONFIG_HZ_250=y / CONFIG_HZ=250, and is it right for that configuration to end up compiled in to Go?

Thanks!


Here’s the test program:

package repro

import (
	"os"
	"runtime"
	"strconv"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

var sink int64

func init() {
	hz, _ := strconv.Atoi(os.Getenv("PROFILE_HZ"))
	if hz > 0 {
		runtime.SetCPUProfileRate(hz)
	}
}

func TestCPUProfile(t *testing.T) {
	workers := runtime.GOMAXPROCS(0)

	var wg sync.WaitGroup
	for i := 0; i < workers; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()

			var v int
			cpuHogger(cpuHog1, &v, 10*time.Second)
			atomic.StoreInt64(&sink, int64(v))
		}()
	}

	wg.Wait()
}

// cpuHogger and cpuHog1 from src/runtime/pprof/pprof_test.go

func cpuHogger(f func(x int) int, y *int, dur time.Duration) {
	// We only need to get one 100 Hz clock tick, so we've got
	// a large safety buffer.
	// But do at least 500 iterations (which should take about 100ms),
	// otherwise TestCPUProfileMultithreaded can fail if only one
	// thread is scheduled during the testing period.
	t0 := time.Now()
	accum := *y
	for i := 0; i < 500 || time.Since(t0) < dur; i++ {
		accum = f(accum)
	}
	*y = accum
}

// The actual CPU hogging function.
// Must not call other functions nor access heap/globals in the loop,
// otherwise under race detector the samples will be in the race runtime.
func cpuHog1(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 1
		}
	}
	return foo
}

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 4
  • Comments: 43 (36 by maintainers)

Commits related to this issue

Most upvoted comments

Actually, the new timers aren’t in use on amd64 until http://golang.org/cl/342054, so it makes sense that the test was failing (i.e., it was added a bit too early). Since the CL is in, we don’t expect any more failures.

While we’re at it, I believe this issue is now complete?

I did the work to split Linux profiling out from the generic Unix profiling code, and added the syscalls to all the architectures that Go+Linux supports, so CL324129 PS3 now (1) compiles and (2) works well enough to run my reproducer test from the top of this issue. Setting GODEBUG=pproftimercreate=1 enables the timer_create/CLOCK_THREAD_CPUTIME_ID-based profiler. It seems to work for GOARCH of 386, amd64, arm, and arm64.

Here are some results I observed on a 64-core arm64 / Aarch64 machine (with 64 real cores and no hyperthreading / SMT). Recall that “cpuHog1” would be better named “work done serially on one thread”, that “cpuHog2” is “work done across all available threads in parallel”, and that those functions each do the same amount of total work. The CL needs some refinement (what to do when a new M starts during profiling, etc), but I think this shows that (1) the skew in setitimer can be severe on readily-available hardware (here, a c6g.metal on EC2), and (2) timer_create seems to fix it.

CC @felixge

First with PS3 but GODEBUG=pproftimercreate=0 to get old setitimer profiling:

$ time GODEBUG=pproftimercreate=0 ./35057_arm64.test -test.cpuprofile=./35057_arm64_0.profile
PASS

real	0m10.421s
user	0m20.148s
sys	0m0.000s

$ go tool pprof -top ./35057_arm64_0.profile
File: 35057_arm64.test
Type: cpu
Time: Jun 4, 2021 at 3:41pm (PDT)
Duration: 10.42s, Total samples = 10330ms (99.15%)
Showing nodes accounting for 10330ms, 100% of 10330ms total
      flat  flat%   sum%        cum   cum%
   10050ms 97.29% 97.29%    10050ms 97.29%  issue/golang/go/35057.cpuHog1
     280ms  2.71%   100%      280ms  2.71%  issue/golang/go/35057.cpuHog2
         0     0%   100%    10330ms   100%  issue/golang/go/35057.TestCPUProfile.func1
         0     0%   100%    10330ms   100%  issue/golang/go/35057.TestCPUProfile.func2.1
         0     0%   100%    10330ms   100%  issue/golang/go/35057.cpuHogger

Then with the same executable with the new timer_create profiler via GODEBUG=pproftimercreate=1:

$ time GODEBUG=pproftimercreate=1 ./35057_arm64.test -test.cpuprofile=./35057_arm64_1.profile
PASS

real	0m10.436s
user	0m20.255s
sys	0m0.010s

$ go tool pprof -top ./35057_arm64_1.profile
File: 35057_arm64.test
Type: cpu
Time: Jun 4, 2021 at 3:41pm (PDT)
Duration: 10.43s, Total samples = 20.06s (192.28%)
Showing nodes accounting for 20.05s, 100% of 20.06s total
Dropped 1 node (cum <= 0.10s)
      flat  flat%   sum%        cum   cum%
    10.06s 50.15% 50.15%     10.06s 50.15%  issue/golang/go/35057.cpuHog1
     9.99s 49.80%   100%        10s 49.85%  issue/golang/go/35057.cpuHog2
         0     0%   100%     20.06s   100%  issue/golang/go/35057.TestCPUProfile.func1
         0     0%   100%     20.06s   100%  issue/golang/go/35057.TestCPUProfile.func2.1
         0     0%   100%     20.06s   100%  issue/golang/go/35057.cpuHogger

Is it worth a follow-up to move the go118UseTimerCreateProfiler flag to gate the test too?

I don’t think we need to bother. It will be pretty obvious in tests if the flag is disabled.

@rhysh thanks for all the detailed analysis on this. I think there is quite a compelling argument, and that we probably want this. I plan to review your CLs this week (sorry for the delay! Just got back from vacation.).

A few notes/thoughts I have:

  • This problem should also be solved by #36821, if accepted. However, PMUs won’t ever be available on all systems, so we’ll need a fallback.
  • I’m curious about the performance cost of enabling timers on thread creation/discovery as well as the kernel overhead of managing all these extra timers. I don’t expect it to be severe, but it would be good to have some measurements to verify.