go: cmd/compile: go1.7beta2 performance regression

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.6.2 darwin/amd64 go version go1.7beta2 darwin/amd64

What operating system and processor architecture are you using (go env)?

Intel i7-3540M (also tried on i7-2677M but didn’t see the same regression)

GOARCH=“amd64” GOHOSTARCH=“amd64” GOHOSTOS=“darwin” GOOS=“darwin” CC=“clang” GOGCCFLAGS=“-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/30/6hyj4x_x783f12hmbcmn5tt00000gn/T/go-build183604977=/tmp/go-build -gno-record-gcc-switches -fno-common” CXX=“clang++” CGO_ENABLED=“1”

What did you do?

https://play.golang.org/p/3WDEr-_QZR

Disassembly (-gcflags -S): https://gist.github.com/samuel/3053bafe149a0459322f6eeaf8bd5ae5

What did you expect to see?

Go 1.7beta2 the same performance or better than Go 1.6

What did you see instead?

benchmark old ns/op new ns/op delta BenchmarkVScaleF32-4 2906 3801 +30.80% BenchmarkVMaxF32-4 2682 3951 +47.32%

benchmark old MB/s new MB/s speedup BenchmarkVScaleF32-4 1409.19 1077.37 0.76x BenchmarkVMaxF32-4 1527.02 1036.64 0.68x

About this issue

Original URL
State: open
Created 8 years ago
Comments: 15 (13 by maintainers)

Commits related to this issue

cmd/compile: missing float indexed loads/stores on amd64 Update #16141 Change-Id: I7d32c5cdc197d86491a67ea579fa16cb3d675b51 Reviewed-on: https://go-review.googlesource.com/28273 Run-TryBot: Keith Ra... — committed to golang/go by randall77 8 years ago

Most upvoted comments

I can’t access playground link (Forbidden), so I couldn’t build it myself and play with code/check perf counters. At a glance it looks like tip version make ucomiss always depend on previous iteration (X0 is always modified)

TocarIP on Sep 1, 2016

I confess I have no idea what is going on here. @TocarIP 1.6 disassembly:

    0x0045 00069 (tmp1.go:10)   MOVSS   (CX), X0
    0x0049 00073 (tmp1.go:11)   CMPQ    AX, DI
    0x004c 00076 (tmp1.go:11)   JCC $1, 103
    0x004e 00078 (tmp1.go:11)   LEAQ    (R9)(AX*4), BX
    0x0052 00082 (tmp1.go:11)   MULSS   X2, X0
    0x0056 00086 (tmp1.go:11)   MOVSS   X0, (BX)
    0x005a 00090 (tmp1.go:10)   ADDQ    $4, CX
    0x005e 00094 (tmp1.go:10)   INCQ    AX
    0x0061 00097 (tmp1.go:10)   CMPQ    AX, SI
    0x0064 00100 (tmp1.go:10)   JLT $0, 69

tip disassembly:

    0x0044 00068 (tmp1.go:10)   MOVSS   (BX), X1
    0x0048 00072 (tmp1.go:11)   MULSS   X0, X1
    0x004c 00076 (tmp1.go:11)   CMPQ    SI, DX
    0x004f 00079 (tmp1.go:11)   JCC $0, 99
    0x0051 00081 (tmp1.go:11)   MOVSS   X1, (CX)(SI*4)
    0x0056 00086 (tmp1.go:10)   ADDQ    $4, BX
    0x005a 00090 (tmp1.go:10)   INCQ    SI
    0x005d 00093 (tmp1.go:10)   CMPQ    SI, AX
    0x0060 00096 (tmp1.go:10)   JLT $0, 68

Yet tip is 28% slower. For no discernible reason I can see.

Max is even worse. go1.6:

    0x004a 00074 (tmp1.go:17)   MOVSS   (AX), X2
    0x004e 00078 (tmp1.go:18)   UCOMISS X3, X2
    0x0051 00081 (tmp1.go:18)   JHI 106
    0x0053 00083 (tmp1.go:17)   ADDQ    $4, AX
    0x0057 00087 (tmp1.go:17)   INCQ    CX
    0x005a 00090 (tmp1.go:17)   CMPQ    CX, DX
    0x005d 00093 (tmp1.go:17)   JLT $0, 74

    0x006a 00106 (tmp1.go:19)   MOVSS   X2, X3
    0x006e 00110 (tmp1.go:17)   JMP 83

tip:

    0x0036 00054 (tmp1.go:17)   MOVSS   (CX), X1
    0x003a 00058 (tmp1.go:18)   UCOMISS X0, X1
    0x003d 00061 (tmp1.go:18)   JLS 94
    0x003f 00063 (tmp1.go:17)   ADDQ    $4, CX
    0x0043 00067 (tmp1.go:17)   INCQ    DX
    0x0046 00070 (tmp1.go:22)   MOVUPS  X1, X0
    0x0049 00073 (tmp1.go:17)   CMPQ    DX, AX
    0x004c 00076 (tmp1.go:17)   JLT $0, 54

    0x005e 00094 (tmp1.go:22)   MOVUPS  X0, X1
    0x0061 00097 (tmp1.go:17)   JMP 63

Tip has 1 extra reg->reg move, yet it is 2.3x slower! 2.3x people. You could compute Max twice using 1.6 and still have some time left over. What the heck is going on? (I tried MOVUPS->MOVSS, didn’t help.)

randall77 on Aug 31, 2016