go: cmd/go: slow "native" performance with Mac OS X 10.14.1 and 10.12.6

Following a debug session with @fatih

What version of Go are you using (go version)?

go version go1.11.2 darwin/amd64

# and

go version go1.11.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
GOARCH="amd64"
GOBIN="/Users/fatih/go/bin"
GOCACHE="/Users/fatih/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/fatih/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.11.2/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.11.2/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/k_/87syx3r50m93m72hvqj2qqlw0000gn/T/go-build130848285=/tmp/go-build -gno-record-gcc-switches -fno-common"

and

GOARCH=“amd64” GOBIN=“” GOCACHE=“/root/.cache/go-build” GOEXE=“” GOFLAGS=“” GOHOSTARCH=“amd64” GOHOSTOS=“linux” GOOS=“linux” GOPATH=“/go” GOPROXY=“” GORACE=“” GOROOT=“/usr/local/go” GOTMPDIR=“” GOTOOLDIR=“/usr/local/go/pkg/tool/linux_amd64” GCCGO=“gccgo” CC=“gcc” CXX=“g++” CGO_ENABLED=“1” GOMOD=“” CGO_CFLAGS=“-g -O2” CGO_CPPFLAGS=“” CGO_CXXFLAGS=“-g -O2” CGO_FFLAGS=“-g -O2” CGO_LDFLAGS=“-g -O2” PKG_CONFIG=“pkg-config” GOGCCFLAGS=“-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build416231211=/tmp/go-build -gno-record-gcc-switches”

What did you do?

The symptom that @fatih and I were investigating was godef performing ~2x more slowly on his fast machine than on my slower machine.

@fatih’s machine specs for reference:

 Model Name:    iMac Pro
  Model Identifier:    iMacPro1,1
  Processor Name:    Intel Xeon W
  Processor Speed:    3 GHz
  Number of Processors:    1
  Total Number of Cores:    10
  L2 Cache (per Core):    1 MB
  L3 Cache:    13.8 MB
  Memory:    64 GB

OS X 10.14.1 (18B75)

We think we have a smaller reproduction that demonstrates what appears to be an OS X 10.14.n issue that may or may not be related to Go. But go list appears to be a good way to demonstrate the problem and hopefully therefore a good place to start further investigation.

The following was run in a Terminal on @fatih’s setup, and then within a Docker container on the same machine. Run-times of the go list commands were compared:

/bin/bash
(
set -eux
export GOPATH=$(mktemp -d)
command cd $(mktemp -d)
git clone https://github.com/digitalocean/csi-digitalocean
command cd csi-digitalocean/
git checkout b9ed6c2ea9ad85c5e4956bdfe418940bb5aa883a
rm go.sum
go mod tidy
command cd driver
time go list -export -deps -json -e . > /dev/null 2>&1
time go list -export -deps -json -e . > /dev/null 2>&1
time go list -export -deps -json -e . > /dev/null 2>&1
time go list -export -deps -json -e . > /dev/null 2>&1
) 2>&1 | tee output.txt

The “native” terminal run shows “average” speeds of ~300ms:

+ go list -export -deps -json -e .
real    0m1.524s
user    0m7.224s
sys 0m2.391s
+ go list -export -deps -json -e .
real    0m0.297s
user    0m0.420s
sys 0m0.898s
+ go list -export -deps -json -e .
real    0m0.300s
user    0m0.403s
sys 0m0.919s
+ go list -export -deps -json -e .
real    0m0.308s
user    0m0.402s
sys 0m0.968s

Whereas the Docker run shows “average” speeds of ~180ms:

+ go list -export -deps -json -e .

real    0m4.837s
user    0m12.070s
sys     0m5.030s
+ go list -export -deps -json -e .

real    0m0.186s
user    0m0.180s
sys     0m0.110s
+ go list -export -deps -json -e .

real    0m0.165s
user    0m0.160s
sys     0m0.120s
+ go list -export -deps -json -e .

real    0m0.187s
user    0m0.190s
sys     0m0.100s

What did you expect to see?

Similar run-times in each, possibly even faster times in the “native” OS X environment.

What did you see instead?

The “native” run-times are almost 70% longer.

cc @bcmills (and FYI @ianthehat for any go/packages side-effects)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 19 (17 by maintainers)

Most upvoted comments

For the benefit of others, @marcan also posted this fantastic thread: https://twitter.com/marcan42/status/1494213855387734019

Just FWIW, this is a problem with Apple NVMe drives (T2 and M1 Macs alike). Their cache flush performance is abysmal. This affects both macOS and (native) Linux (presumably existing VM solutions fail to forward this as F_FULLSYNC if they run faster). You get about 46 IOPS of F_FULLSYNC on macOS on an M1 machine, and about the same on native Linux with fsync().

So there isn’t much to be done on the software side other than deciding whether you care about data integrity on macOS or not; on Linux there is no way for user processes to control this as far as I know (i.e. flush to cache but not the cache), though it can be emulated by setting the drive write cache type to “write through” in sysfs (which tells the kernel to never issue flush commands, so fsync() behaves like it does on macOS).

I don’t have an answer to why macOS 10.12 and 10.14 performance would be different, but I think I know why Docker and macOS are different.

As far as I understand, Docker on macOS actually runs a Linux kernel and userspace under a hypervisor. That means overhead for I/O and system calls is more similar to native Linux overhead than it is to macOS overhead. “go list” does a lot of stat and read system calls on small files, so the system call overhead matters a lot.

Here are the results from a quick benchmark I ran. The Read and Write tests are reading and writing slices of various sizes to a file (the file is almost certainly in the kernel file cache, so disk speed doesn’t matter). The Import test is measuring the time to call go/build.Context.Import on "fmt".

macOS native:

BenchmarkStat-8            	  300000	      4470 ns/op
BenchmarkRead/1K-8         	   50000	     27366 ns/op
BenchmarkRead/50K-8        	   50000	     30980 ns/op
BenchmarkRead/1024K-8      	   10000	    104866 ns/op
BenchmarkWrite/1K-8        	   10000	    151379 ns/op
BenchmarkWrite/50K-8       	    5000	    311861 ns/op
BenchmarkWrite/1024K-8     	    1000	   2020063 ns/op
BenchmarkImport-8          	    2000	    919822 ns/op

macOS Docker

BenchmarkStat-4            	 1000000	      1222 ns/op
BenchmarkRead/1K-4         	  300000	      4101 ns/op
BenchmarkRead/50K-4        	  200000	      7405 ns/op
BenchmarkRead/1024K-4      	   20000	     92996 ns/op
BenchmarkWrite/1K-4        	   10000	    181860 ns/op
BenchmarkWrite/50K-4       	   10000	    341137 ns/op
BenchmarkWrite/1024K-4     	    1000	   2404169 ns/op
BenchmarkImport-4          	    2000	    580594 ns/op

Notably, the Stat and Read/1K tests are much, much slower on native macOS. Windows is quite a bit slower than this.

@randall77 is there some indication that this situation will improve in the future? Is Apple addressing the fsync issue? I’m now at 15sec build times on Darwin vs 4.5 sec in a Fedora VM running on the same super-beefy hardware.