go: runtime: FreeBSD nanotime performance regression with different CPU setups

The setup

$ go version
go version go1.17.6 freebsd/amd64
$ freebsd-version
12.2-RELEASE

1.17.6 is currently the latest release.

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="freebsd"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1526234500=/tmp/go-build -gno-record-gcc-switches"

The case

I’ve found that the performance of the scheduler on FreeBSD highly depends on the CPU setup I have. While it is OK for a one-CPU VirtualBox instance running on my Mac, it decreases a lot either for more-than-one-CPU VirtualBox or for any-number-of-CPUs DigitalOcean instances (as far as I know DigitalOcean uses KVM as a hypervisor for their instances).

This benchmark can be used as a minimal reproducer

$ cat scheduler_test.go

package whatever

import (
	"testing"
)

func BenchmarkScheduler(b *testing.B) {
	ch := make(chan struct{})

	go func() {
		for range ch {}
	}()

	for i := 0; i < b.N; i++ {
		ch <- struct{}{}
	}
}

go test -bench=. scheduler_test.go

While this benchmark runs on a single-CPU VirtualBox instance it performs quite convenient.

$ sysctl hw.ncpu
hw.ncpu: 1

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler 	 7658449	       152.8 ns/op
PASS
ok  	command-line-arguments	1.331s

However if the benchmark was run on more-than-one-CPU VirtualBox or on a DigitalOcean instance with any number of CPUs the significant decrease in performance emerges.

$ sysctl hw.ncpu
hw.ncpu: 2

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler-2   	 1108549	      1163 ns/op
PASS
ok  	command-line-arguments	2.213s

$ sysctl hw.ncpu
hw.ncpu: 1

$ go test -bench=. scheduler_test.go

goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	  706951	      1949 ns/op
PASS
ok  	command-line-arguments	1.405s

Profiling shows that the scheduler code for FreeBSD relies highly on time machinery. And runtime.nanotime() performance highly affects the scheduler.

Execution flow for 1CPU VirtualBox

Execution flow for 2CPUs VirtualBox

Execution flow for 1CPU DigitalOcean

I’ve also made benchmarks for the time.Now() function in the same setups. The results of them correlates with scheduler ones.

About this issue

Original URL
State: open
Created 2 years ago
Comments: 24 (11 by maintainers)

Commits related to this issue

runtime: fast clock_gettime on FreeBSD, split getHPETTimecounter Call only initHPETTimecounter on the system stack. Use O_CLOEXEC flag when opening the HPET device. FreeBSD 12.3-RELEASE-p2, AMD FX-8... — committed to golang/go by paulzhol 2 years ago
runtime: fast clock_gettime on FreeBSD, use physical/virtual ARM timer as setup by the kernel on GOARCH=arm64. Update #50947 Change-Id: I2f44be9b36e9ce8d264eccc0aa3df10825c5f4f9 Reviewed-on: https:... — committed to golang/go by paulzhol 2 years ago

Most upvoted comments

I’ve sent a small improvement above, for the HPET timecounter path to switch to the system stack only once on first call. My system (AMD FX-8300 hardware) shows a small improvement in BenchmarkNow. However it behaves very differently than yours.

kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 1
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-safe(850) HPET(950) i8254(0) TSC-low(1000) dummy(-1000000)
kern.timecounter.hardware: TSC-low

For example forcing HPET, then disabling kern.timecounter.fast_gettime (which forces Go to fallback to the regular syscall) https://github.com/golang/go/blob/7b1ba972dc5687f6746b2299b047f44e38bc6686/src/runtime/vdso_freebsd.go#L52-L54 https://github.com/golang/go/blob/7b1ba972dc5687f6746b2299b047f44e38bc6686/src/runtime/vdso_freebsd.go#L99-L114

There’s a noticeable difference:

root@relic:~ # sysctl -w kern.timecounter.hardware=HPET
kern.timecounter.hardware: TSC-low -> HPET

paulzhol@relic:~/go/src/time % ../../bin/go test -run=NONE -bench=BenchmarkNow ./... > old_hpet.txt

root@relic:/tmp # sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0

paulzhol@relic:~/go/src/time % ../../bin/go test -run=NONE -bench=BenchmarkNow ./... > old_hpet_no_fast.txt

paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp old_hpet_no_fast.txt old_hpet.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                   old ns/op     new ns/op     delta
BenchmarkNow-8              1832          1420          -22.49%
BenchmarkNowUnixNano-8      1834          1421          -22.52%
BenchmarkNowUnixMilli-8     1835          1423          -22.45%
BenchmarkNowUnixMicro-8     1833          1423          -22.37%

Specifically ACPI-safe is much slower:

paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp baseline_acpi.txt old_hpet.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                   old ns/op     new ns/op     delta
BenchmarkNow-8              6843          1420          -79.25%
BenchmarkNowUnixNano-8      6946          1421          -79.54%
BenchmarkNowUnixMilli-8     6914          1423          -79.42%
BenchmarkNowUnixMicro-8     6923          1423          -79.45%

FreeBSD 12.3 added support for kvmclock, I believe it should be much faster than using HPET (which is essentially a non-paravirtualized emulated hardware device). I’ll try to prepare a PR for it when I have the time.

paulzhol on Mar 11, 2022