go: cmd/compile: unexpected performance difference accessing slices with different caps
Using tip (go version devel +b794ca64d2 Mon Sep 3 07:14:25 2018 +0000 linux/amd64)
While implementing some bounds check optimizations in image/draw
CL 136935 I came across an unexpected performance difference between accessing elements of a slice s created like this
s := spix[i : i+4 : len(spix)]
and
s := spix[i : i+4 : i+4]
Both forms eliminate bound checks for accessing elements 0-3 of s but the second form has a significant performance improvement over the first (see CL for benchmarks).
Building with go build -gcflags="-d=ssa/check_bce/debug=1"
shows no difference in bounds checks between the two forms so presumably it has something to do with the resulting size of the slice.
Attached is a simple program and disassembly derived from the image/draw
code that demonstrates the effect. You can see that the assembly generated for the second form is quite a bit shorter.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 17 (13 by maintainers)
Commits related to this issue
- image/draw: optimize bounds checks in loops Use subslices with known length and cap to give bounds checking hints to the compiler. Improves over the earlier pointer based optimizations in https://go-... — committed to golang/go by iand 6 years ago
- image: optimize bounds checking for At and Set methods Use a subslice of the pixel data to give the compiler hints for bounds checking. Only do this for image formats that require 4 or more slice rea... — committed to golang/go by iand 6 years ago
Go has a special case when slicing and generating the empty slice. This has the potential to generate a pointer to the next object in memory, like this:
The resulting
x
has capacity 0, and if we just computed its base pointer asx.ptr += 4*sizeof(int)
thenx.ptr
would point to the next object in memory after the[4]int
that was allocated. This is bad because it could mistakenly retain a random object in the heap.So when slicing, we need to somehow avoid contructing such a pointer. Currently we do this by avoiding the add when the capacity is 0. So we do
x.ptr += newcap == 0 ? 0 : 4*sizeof(int)
. And we do it without a conditional, using shift and mask tricks.When you do
spix[i:i+4:i+4]
the new capacity is 4, and the compiler knows that. So the conditional in the expression above constant folds, and we just getx.ptr += i*sizeof(int)
. When you dospix[i:i+4:len(spix)]
, it’s not obvious whether the resulting slice has capacity 0 or not, so the arithmetic is still all there.The optimization that would fix this is that if the resulting slice is known to have nonzero length (we know that here because i+4-i > 0), we don’t need to guard against next object pointers.
Interesting. How often do we slice in a loop as opposed to straight line code? Seems worth exploring.
Another, crazier, idea is to change the underlying representation in memory of all slices to be {ptr, len, cap-len/extra/slack}. This would make calling
cap
involve addition, but I think that callingcap
is much rarer than slicing, appending, etc. This would probably break so much code as to be impossible, though.Conflict, I think. The idea behind OpReslice is to bundle a bunch of these calculations into a single op (easier to deal with during prove, etc.) and do the complex decomposition on a per-arch basis. I’m quite content to let you experiment with your ideas first, since they seem quite promising. And if you end up not liking the results, we can experiment some with OpReslice etc.
@mundaym if you want a real life test case, the code in https://github.com/golang/go/issues/31586#issuecomment-486469682 spends about 50% of its time reslicing in a loop. TL;DR:
go get nhooyr.io/websocket
,git checkout 40b4
,go test -bench=Mask/4096//nhooyr
, line 74 (b = b[8:]
) is >50% of execution time. (cc @nhooyr)One argument in favor of writing out the if statement: For code like this
Then we can entirely avoid adjusting ptr, whereas with masking tricks we always end up masking. (We could add a special check for boundedness, like we do with shifts, and perhaps should, but with a branch, it falls out for free.) Code example from discussion at https://go-review.googlesource.com/c/go/+/169518