go: cmd/compile: CMOV instruction slower than branch

What version of Go are you using (go version)?

go1.11beta1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH=“amd64” GOBIN=“” GOCACHE=“/home/rk/.cache/go-build” GOEXE=“” GOHOSTARCH=“amd64” GOHOSTOS=“linux” GOOS=“linux” GOPATH=“/home/rk/GoSpace/Projects” GOPROXY=“” GORACE=“” GOROOT=“/home/rk/GoSpace/GO” GOTMPDIR=“” GOTOOLDIR=“/home/rk/GoSpace/GO/pkg/tool/linux_amd64” GCCGO=“gccgo” CC=“gcc” CXX=“g++” CGO_ENABLED=“1” CGO_CFLAGS=“-g -O2” CGO_CPPFLAGS=“” CGO_CXXFLAGS=“-g -O2” CGO_FFLAGS=“-g -O2” CGO_LDFLAGS=“-g -O2” PKG_CONFIG=“pkg-config” GOGCCFLAGS=“-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473408654=/tmp/go-build -gno-record-gcc-switches” VGOMODROOT=“”

What did you do?

I built the code below with go 1.10 and go 1.11. https://play.golang.org/p/22MEbiXFpzo

What did you expect to see?

The binary built by go 1.11 is as fast as that built by go 1.10.

What did you see instead?

The binary built by go 1.11 is incredibly slower than that built by go 1.10.

Go 1.11 compiles the function “down” to assembly like this:

    MOVQ pos+32(SP), AX
    MOVQ list+8(SP), CX
    MOVL (CX)(AX*4), DX
    LEAQ 1(AX)(AX*1), BX
    MOVQ list+16(SP), SI
    DECQ SI
    JMP L42
L28:
    MOVL DI, (CX)(AX*4)
    LEAQ 1(BX)(BX*1), DI
    MOVQ BX, AX
    MOVQ DI, BX
L42:
    CMPQ BX, SI
    JGE L73
    MOVL 4(CX)(BX*4), DI
    LEAQ 1(BX), R8
    MOVL (CX)(BX*4), R9
    CMPL DI, R9
    CMOVQHI R8, BX
//  JHI L100
L66:
    MOVL (CX)(BX*4), DI
    CMPL DX, DI
    JCS L28
L73:    
    CMPQ BX, SI
    JNE L97
    MOVL (CX)(BX*4), SI
    CMPL DX, SI
    JCC L92
    MOVL SI, (CX)(AX*4)
L88:    
    MOVL DX, (CX)(BX*4)
    RET
L92:
    MOVQ AX, BX
    JMP L88
L97:    
    MOVQ AX, BX
    JMP L88
L100:
//  MOVQ R8, BX
//  JMP L66

If replacing the CMOV instruction with a branch, it can be 80% faster.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (14 by maintainers)

Most upvoted comments

I see similar things. I suspect there are two different issues - one with N=100 and one with large N. I’m going to focus on the large N for now.

With a few more data points (old=tip with cmov generation turned off, new=tip):

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          1752           2022           +15.41%
BenchmarkHeapSort/1000-8         53765          35192          -34.54%
BenchmarkHeapSort/10000-8        760440         502731         -33.89%
BenchmarkHeapSort/100000-8       9790058        7496200        -23.43%
BenchmarkHeapSort/1000000-8      131420794      130939404      -0.37%
BenchmarkHeapSort/2000000-8      294149147      333858182      +13.50%
BenchmarkHeapSort/10000000-8     2271577039     3583052169     +57.73%
BenchmarkHeapSort/20000000-8     5213875449     9264420912     +77.69%

As the array gets larger, cmov gets relatively slower. My L3 cache is 10MB, which is N=1.25e6, just about at the inflection point. The cmov is only being used to select the correct child, like this:

if list[kid+1] > list[kid] {
	kid++
}

The array size dependence seems an awful lot like cache effects. If the array we’re sorting fits in L3 cache, it is fast. If the array does not fit, it’s slower. My suspicion is that when using branches, if the branch is predicted correctly (probably 50% of the time) we can issue the next read (the grandchild in the heap) in parallel with the current read. Using cmov, there’s a data dependence from one load to the next, so we never issue two of them in parallel. That might be enough of an effect to explain the result - the cmov serializes the reads, and the cache miss is so slow that the branch mispredict & restart can all be hidden in the cache miss.

It seems a pretty niche case where this effect comes into play. I wouldn’t want to remove cmov generation because of it.

The benchmark code has the problem that after the first iteration slice a is already sorted. Here is a fix: https://play.golang.org/p/b_-IC1OSEWq

The new benchstat looks like this:

name                 old time/op  new time/op  delta
HeapSort/100-4       5.59µs ± 3%  3.74µs ± 4%  -33.14%  (p=0.000 n=10+10)
HeapSort/1000-4      75.0µs ± 0%  50.1µs ± 1%  -33.26%   (p=0.000 n=7+10)
HeapSort/10000-4      879µs ± 2%   610µs ± 0%  -30.61%   (p=0.000 n=10+8)
HeapSort/100000-4    10.8ms ± 0%   8.8ms ± 1%  -18.80%   (p=0.000 n=8+10)
HeapSort/1000000-4    141ms ± 1%   146ms ± 1%   +3.62%  (p=0.000 n=10+10)
HeapSort/10000000-4   2.54s ± 3%   4.23s ± 6%  +66.86%  (p=0.000 n=10+10)

Apparently branch prediction and speculative execution (loads) are faster than CMOV if the slice size exceeds the level 2 cache.

I can somewhat verify my claim. I can add a prefetch to the benchmark and that mostly fixes the large N behavior. At the head of the loop in down, do:

if kid*2 < last {
	_ = atomic.LoadUint32(&list[kid*2])
}

Now old=nocmov, new=tip:

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          2029           2262           +11.48%
BenchmarkHeapSort/1000-8         56825          37950          -33.22%
BenchmarkHeapSort/10000-8        807093         526252         -34.80%
BenchmarkHeapSort/100000-8       10113466       6808430        -32.68%
BenchmarkHeapSort/1000000-8      130448267      101057677      -22.53%
BenchmarkHeapSort/2000000-8      295489785      243496350      -17.60%
BenchmarkHeapSort/4000000-8      721814370      646561035      -10.43%
BenchmarkHeapSort/6000000-8      1202974592     1161344992     -3.46%
BenchmarkHeapSort/8000000-8      1729255015     1735633353     +0.37%
BenchmarkHeapSort/10000000-8     2274368639     2370006748     +4.21%
BenchmarkHeapSort/20000000-8     5265656303     5989478005     +13.75%

There is still some effect, but it is much reduced.