go: cmd/compile: CMOV instruction slower than branch
What version of Go are you using (go version
)?
go1.11beta1 linux/amd64
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env
)?
GOARCH=“amd64” GOBIN=“” GOCACHE=“/home/rk/.cache/go-build” GOEXE=“” GOHOSTARCH=“amd64” GOHOSTOS=“linux” GOOS=“linux” GOPATH=“/home/rk/GoSpace/Projects” GOPROXY=“” GORACE=“” GOROOT=“/home/rk/GoSpace/GO” GOTMPDIR=“” GOTOOLDIR=“/home/rk/GoSpace/GO/pkg/tool/linux_amd64” GCCGO=“gccgo” CC=“gcc” CXX=“g++” CGO_ENABLED=“1” CGO_CFLAGS=“-g -O2” CGO_CPPFLAGS=“” CGO_CXXFLAGS=“-g -O2” CGO_FFLAGS=“-g -O2” CGO_LDFLAGS=“-g -O2” PKG_CONFIG=“pkg-config” GOGCCFLAGS=“-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473408654=/tmp/go-build -gno-record-gcc-switches” VGOMODROOT=“”
What did you do?
I built the code below with go 1.10 and go 1.11. https://play.golang.org/p/22MEbiXFpzo
What did you expect to see?
The binary built by go 1.11 is as fast as that built by go 1.10.
What did you see instead?
The binary built by go 1.11 is incredibly slower than that built by go 1.10.
Go 1.11 compiles the function “down” to assembly like this:
MOVQ pos+32(SP), AX
MOVQ list+8(SP), CX
MOVL (CX)(AX*4), DX
LEAQ 1(AX)(AX*1), BX
MOVQ list+16(SP), SI
DECQ SI
JMP L42
L28:
MOVL DI, (CX)(AX*4)
LEAQ 1(BX)(BX*1), DI
MOVQ BX, AX
MOVQ DI, BX
L42:
CMPQ BX, SI
JGE L73
MOVL 4(CX)(BX*4), DI
LEAQ 1(BX), R8
MOVL (CX)(BX*4), R9
CMPL DI, R9
CMOVQHI R8, BX
// JHI L100
L66:
MOVL (CX)(BX*4), DI
CMPL DX, DI
JCS L28
L73:
CMPQ BX, SI
JNE L97
MOVL (CX)(BX*4), SI
CMPL DX, SI
JCC L92
MOVL SI, (CX)(AX*4)
L88:
MOVL DX, (CX)(BX*4)
RET
L92:
MOVQ AX, BX
JMP L88
L97:
MOVQ AX, BX
JMP L88
L100:
// MOVQ R8, BX
// JMP L66
If replacing the CMOV instruction with a branch, it can be 80% faster.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 18 (14 by maintainers)
I see similar things. I suspect there are two different issues - one with N=100 and one with large N. I’m going to focus on the large N for now.
With a few more data points (old=tip with cmov generation turned off, new=tip):
As the array gets larger, cmov gets relatively slower. My L3 cache is 10MB, which is N=1.25e6, just about at the inflection point. The cmov is only being used to select the correct child, like this:
The array size dependence seems an awful lot like cache effects. If the array we’re sorting fits in L3 cache, it is fast. If the array does not fit, it’s slower. My suspicion is that when using branches, if the branch is predicted correctly (probably 50% of the time) we can issue the next read (the grandchild in the heap) in parallel with the current read. Using cmov, there’s a data dependence from one load to the next, so we never issue two of them in parallel. That might be enough of an effect to explain the result - the cmov serializes the reads, and the cache miss is so slow that the branch mispredict & restart can all be hidden in the cache miss.
It seems a pretty niche case where this effect comes into play. I wouldn’t want to remove cmov generation because of it.
The benchmark code has the problem that after the first iteration slice a is already sorted. Here is a fix: https://play.golang.org/p/b_-IC1OSEWq
The new benchstat looks like this:
Apparently branch prediction and speculative execution (loads) are faster than CMOV if the slice size exceeds the level 2 cache.
I can somewhat verify my claim. I can add a prefetch to the benchmark and that mostly fixes the large N behavior. At the head of the loop in
down
, do:Now old=nocmov, new=tip:
There is still some effect, but it is much reduced.