go: cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases

What version of Go are you using (go version)?

$ go version
go version go1.18beta1 windows/amd64

Does this issue reproduce with the latest release?

It reproduces on the 1.18 Beta 1 release.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\argusdusty\AppData\Local\go-build
set GOENV=C:\Users\argusdusty\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\argusdusty\Code\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\argusdusty\Code
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.18beta1
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\argusdusty\Code\src\foo\go.mod
set GOWORK=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\ARGUSD~1\AppData\Local\Temp\go-build3090171098=/tmp/go-build -gno-record-gcc-switches

What did you do?

sort_test.go:

package sort_test

import (
	"sort"
	"testing"
)

func IsSortedGeneric[T sort.Interface](data T) bool {
	n := data.Len()
	for i := n - 1; i > 0; i-- {
		if data.Less(i, i-1) {
			return false
		}
	}
	return true
}

var data = sort.IntSlice{-10, -5, 0, 1, 2, 3, 5, 7, 11, 100, 100, 100, 1000, 10000}

func BenchmarkIsSorted(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sort.IsSorted(data)
	}
}

func BenchmarkIsSortedGeneric(b *testing.B) {
	for i := 0; i < b.N; i++ {
		IsSortedGeneric[sort.IntSlice](data)
	}
}

go test -bench=. -cpu=1 -benchmem sort_test.go

What did you expect to see?

IsSortedGeneric to be faster than, or at least very similar in runtime to, the otherwise identical sort.IsSorted.

What did you see instead?

goos: windows
goarch: amd64
cpu: AMD Ryzen 5 3600X 6-Core Processor
BenchmarkIsSorted               21008770                55.26 ns/op           24 B/op          1 allocs/op
BenchmarkIsSortedGeneric         2859728               421.3 ns/op           336 B/op         14 allocs/op
PASS
ok      command-line-arguments  3.003s

This appears to be a problem specifically when methods are part of the type constraint - and you may note above that the number of allocs (14) is exactly equal to the number of method calls (1 .Len, 13 .Less calls). Compiler output points to runtime.convTslice triggered at each method call as the cause of the allocs.

A generic IsSorted implementation that restricts to a slice of constraints.Ordered does not have this slowdown. Experimentation also finds that supplying a pointer type instead of a value does not have this issue:

type PtrIntSlice []int

func (x *PtrIntSlice) Len() int           { return len(*x) }
func (x *PtrIntSlice) Less(i, j int) bool { return (*x)[i] < (*x)[j] }
func (x *PtrIntSlice) Swap(i, j int)      { (*x)[i], (*x)[j] = (*x)[j], (*x)[i] }

var ptrdata = PtrIntSlice{-10, -5, 0, 1, 2, 3, 5, 7, 11, 100, 100, 100, 1000, 10000}

func BenchmarkIsSortedPtrGeneric(b *testing.B) {
	for i := 0; i < b.N; i++ {
		IsSortedGeneric[*PtrIntSlice](&ptrdata)
	}
}

BenchmarkIsSortedPtrGeneric 44432265 27.07 ns/op 0 B/op 0 allocs/op

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Just wanted to share some benchmarks showing the improvement for the example at the top of this issue.

  • The first change in CL 378178 that was merged in January reduced the time by ~90% (compared to tip as of 4f6f68e, which was before any change for this issue).

  • The pending CL 385274 reduces the time by another ~20% (and also simplifies the implementation).

$ benchstat tip-4f6f68e cl-378178 cl-385274 

name \ time/op      tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32   461ns ± 1%    42ns ± 0%    34ns ± 0%
IsSorted-32         65.4ns ± 1%  65.3ns ± 1%  65.5ns ± 1%

name \ alloc/op     tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32    336B ± 0%      0B           0B     
IsSorted-32          24.0B ± 0%   24.0B ± 0%   24.0B ± 0%

name \ allocs/op    tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32    14.0 ± 0%     0.0          0.0     
IsSorted-32           1.00 ± 0%    1.00 ± 0%    1.00 ± 0%

Change https://golang.org/cl/378178 mentions this issue: cmd/compile: stop interface conversions for generic method calls from allocating

If you use -gcflags=all='-G=3 -d=unified=1', the generic version becomes much faster than the non-generic one.

gotip version
go version devel go1.18-0f05ed3b78 Tue Dec 14 20:25:03 2021 +0000 linux/amd64

gotip test -bench=. -cpu=1 -benchmem ./...

goos: linux
goarch: amd64
pkg: cool
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkIsSorted        	23439408	        51.01 ns/op	      24 B/op	       1 allocs/op
BenchmarkIsSortedGeneric 	 3412116	       341.3 ns/op	     336 B/op	      14 allocs/op

gotip test -gcflags=all='-G=3 -d=unified=1' -bench=. -cpu=1 -benchmem ./...

goos: linux
goarch: amd64
pkg: cool
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkIsSorted        	24290271	        49.23 ns/op	      24 B/op	       1 allocs/op
BenchmarkIsSortedGeneric 	232407577	         5.210 ns/op	       0 B/op	       0 allocs/op

I still do not know what the unified flag does. I saw it once on @mdempsky’s twitch livestream

@g-talbot The goal of generics is not to provide faster execution time. The goal is to make it easier to write certain kinds of code with less repetitive boilerplate and more compile-time type checking. Rewriting code to use generics may sometimes make it faster. It may sometimes make it slower (as in this example). While we will continue to improve the compiler balancing the needs of fast builds and fast runtimes, it will never be the case that generic code will always be faster than non-generic code. That is not a goal.

@komuw It was interesting to me, so I took a dive, and what I found was: as described here unified IR doesn’t currently do dictionary (shape like) approach for generics, and instead does stenciling (aka separate code for each type no matter memory representation). So resulting binary is indeed faster, but ALOT bigger. That also explains why there was no allocations with this flag.

Further reading: [dev.typeparams] all: add GOEXPERIMENT=unified knob https://github.com/golang/go/issues/46786 [go/dev.typeparams] [dev.typeparams] cmd/compile: unified IR construction

IsSortedGeneric[sort.IntSlice](data)
IsSortedGeneric[sort.IntSlice](data)

@randall77 But you do know the exact type. It’s passed right here in the brackets. The compiler should be able to deduce that it can short-circuit the type here and that only sort.IntSlice’s Less() implementation is being called. This is a real issue. What’s the point of generics if the compiler can’t short-circuit the interface machinery when it knows the precise types? You lose the performance benefit via inlining if you can’t do that.

@randall77 , why does that lead to 14 allocations per iteration, though? I would expect that to result in 1 additional allocation per iteration. Instead, we seem to be allocating for each element of the slice.

@randall77 But you do know the exact type. It’s passed right here in the brackets.

We don’t know it when compiling IsSortedGeneric. We don’t compile IsSortedGeneric[T] for every T that instantiates it. We compile it only for each distinct shape of the instantiating T. See https://github.com/golang/proposal/blob/master/design/generics-implementation-gcshape.md , and in particular the second bullet in the risks section.