go: runtime: FreeBSD nanotime performance regression with different CPU setups
The setup
$ go version go version go1.17.6 freebsd/amd64 $ freebsd-version 12.2-RELEASE
1.17.6 is currently the latest release.
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/root/.cache/go-build" GOENV="/root/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="freebsd" GOINSECURE="" GOMODCACHE="/root/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="freebsd" GOPATH="/root/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64" GOVCS="" GOVERSION="go1.17.6" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1526234500=/tmp/go-build -gno-record-gcc-switches"
The case
I’ve found that the performance of the scheduler on FreeBSD highly depends on the CPU setup I have. While it is OK for a one-CPU VirtualBox instance running on my Mac, it decreases a lot either for more-than-one-CPU VirtualBox or for any-number-of-CPUs DigitalOcean instances (as far as I know DigitalOcean uses KVM as a hypervisor for their instances).
This benchmark can be used as a minimal reproducer
$ cat scheduler_test.go
package whatever import ( "testing" ) func BenchmarkScheduler(b *testing.B) { ch := make(chan struct{}) go func() { for range ch {} }() for i := 0; i < b.N; i++ { ch <- struct{}{} } }
go test -bench=. scheduler_test.go
While this benchmark runs on a single-CPU VirtualBox instance it performs quite convenient.
$ sysctl hw.ncpu
hw.ncpu: 1
$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler 7658449 152.8 ns/op
PASS
ok command-line-arguments 1.331s
However if the benchmark was run on more-than-one-CPU VirtualBox or on a DigitalOcean instance with any number of CPUs the significant decrease in performance emerges.
$ sysctl hw.ncpu
hw.ncpu: 2
$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler-2 1108549 1163 ns/op
PASS
ok command-line-arguments 2.213s
$ sysctl hw.ncpu
hw.ncpu: 1
$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 706951 1949 ns/op
PASS
ok command-line-arguments 1.405s
Profiling shows that the scheduler code for FreeBSD relies highly on time machinery. And runtime.nanotime()
performance highly affects the scheduler.
Execution flow for 1CPU VirtualBox

Execution flow for 2CPUs VirtualBox

Execution flow for 1CPU DigitalOcean

I’ve also made benchmarks for the time.Now()
function in the same setups. The results of them correlates with scheduler ones.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 24 (11 by maintainers)
Commits related to this issue
- runtime: fast clock_gettime on FreeBSD, split getHPETTimecounter Call only initHPETTimecounter on the system stack. Use O_CLOEXEC flag when opening the HPET device. FreeBSD 12.3-RELEASE-p2, AMD FX-8... — committed to golang/go by paulzhol 2 years ago
- runtime: fast clock_gettime on FreeBSD, use physical/virtual ARM timer as setup by the kernel on GOARCH=arm64. Update #50947 Change-Id: I2f44be9b36e9ce8d264eccc0aa3df10825c5f4f9 Reviewed-on: https:... — committed to golang/go by paulzhol 2 years ago
I’ve sent a small improvement above, for the HPET timecounter path to switch to the system stack only once on first call. My system (AMD FX-8300 hardware) shows a small improvement in
BenchmarkNow
. However it behaves very differently than yours.For example forcing HPET, then disabling
kern.timecounter.fast_gettime
(which forces Go to fallback to the regular syscall) https://github.com/golang/go/blob/7b1ba972dc5687f6746b2299b047f44e38bc6686/src/runtime/vdso_freebsd.go#L52-L54 https://github.com/golang/go/blob/7b1ba972dc5687f6746b2299b047f44e38bc6686/src/runtime/vdso_freebsd.go#L99-L114There’s a noticeable difference:
Specifically ACPI-safe is much slower:
FreeBSD 12.3 added support for kvmclock, I believe it should be much faster than using HPET (which is essentially a non-paravirtualized emulated hardware device). I’ll try to prepare a PR for it when I have the time.