go: runtime: make([]byte, n) becomes much slower compared with go 1.11.1

What version of Go are you using (`go version`)?

go version devel +2e9f081 Tue Oct 30 04:39:53 2018 +0000 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

GOARCH=“amd64” GOBIN=“” GOCACHE=“/home/lni/.cache/go-build” GOEXE=“” GOFLAGS=“” GOHOSTARCH=“amd64” GOHOSTOS=“linux” GOOS=“linux” GOPATH=“/home/lni/golang_ws” GOPROXY=“” GORACE=“” GOROOT=“/usr/local/go” GOTMPDIR=“” GOTOOLDIR=“/usr/local/go/pkg/tool/linux_amd64” GCCGO=“gccgo” CC=“gcc” CXX=“g++” CGO_ENABLED=“1” GOMOD=“” CGO_CFLAGS=“-g -O2” CGO_CPPFLAGS=“” CGO_CXXFLAGS=“-g -O2” CGO_FFLAGS=“-g -O2” CGO_LDFLAGS=“-g -O2” PKG_CONFIG=“pkg-config” GOGCCFLAGS=“-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build682632202=/tmp/go-build -gno-record-gcc-switches”

What did you do?

I tried the devel version of go for both devel +80b8377 and devel +2e9f081, my go program became slower. When checking the benchmarks, I noticed that make([]byte, n) accessed in parallel is much slower when compared to go 1.11.1

package makebenchmark

import (
  "testing"
)

func benchmarkMakeByteSliceInParalell(b *testing.B, sz int) {
  b.RunParallel(func(pb *testing.PB) {
    for pb.Next() {
      m := make([]byte, sz)
      if len(m) != sz {
        b.Errorf("unexpected len")
      }
    }
  })
}

func BenchmarkMakeByteSliceInParalell512(b *testing.B) {
  benchmarkMakeByteSliceInParalell(b, 512)
}

func BenchmarkMakeByteSliceInParalell8(b *testing.B) {
  benchmarkMakeByteSliceInParalell(b, 8)
}

go 1.11.1:

go test -count=1 -v   -benchmem -run ^$ -bench=BenchmarkMakeByteSliceInParalell
goos: linux
goarch: amd64
pkg: github.com/lni/goplayground/makebenchmark
BenchmarkMakeByteSliceInParalell512-40    	10000000	       239 ns/op	     512 B/op	       1 allocs/op
BenchmarkMakeByteSliceInParalell8-40      	300000000	         4.08 ns/op	       8 B/op	       1 allocs/op
PASS

go version
go version go1.11.1 linux/amd64

go version devel +2e9f081 and devel +80b8377 reported similar results

/home/lni/src/go/bin/go test -count=1 -v   -benchmem -run ^$ -bench=BenchmarkMakeByteSliceInParalell
goos: linux
goarch: amd64
pkg: github.com/lni/goplayground/makebenchmark
BenchmarkMakeByteSliceInParalell512-40    	 3000000	       564 ns/op	     512 B/op	       1 allocs/op
BenchmarkMakeByteSliceInParalell8-40      	300000000	         5.09 ns/op	       8 B/op	       1 allocs/op
PASS

/home/lni/src/go/bin/go version
go version devel +2e9f081 Tue Oct 30 04:39:53 2018 +0000 linux/amd64

What did you expect to see?

make([]byte, n) has similar ns/op when compared with 1.11.1

What did you see instead?

make([]byte, n) is much slower

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 7
Comments: 22 (13 by maintainers)

Commits related to this issue

runtime: add iterator abstraction for mTreap This change adds the treapIter type which provides an iterator abstraction for walking over an mTreap. In particular, the mTreap type now has iter() and r... — committed to golang/go by mknyszek 6 years ago
runtime: make mTreap iterator bidirectional This change makes mTreap's iterator type, treapIter, bidirectional instead of unidirectional. This change helps support moving the find operation on a trea... — committed to golang/go by mknyszek 5 years ago

Most upvoted comments

@lni do you mind trying out this patch with your application/library? (https://go-review.googlesource.com/c/go/+/151538)

The original numbers are a little harder to reproduce now as it seems other runtime/compiler improvements are hiding the original performance regression a little bit, but in some preliminary benchmarking it looked like the change above helped with your original benchmark.

I still need to start some longer runs for better statistical significance so I’ll update here with those soon.

mknyszek on Nov 27, 2018

As reusee@ sleuthed out, it’s probably my commit (07e738e). Thank you everyone for looking into this.

I took some time to dig in and understand exactly what was going on. This benchmark exercises the case where we’re allocating size-segregated spans with object size 512, which we then free, and re-allocate from. At first, I thought that the regression might have been the result of span allocation having to traverse the treap, instead of walking over a fixed-size array. But, CPU profiling shows that it’s more likely a result of freeing spans. Before my change, freeing these size-segregated spans was a very fast operation: indexing into an array followed by a linked list insertion at head. Now, one needs to traverse a treap, allocate a treap node out of a SLAB, and rebalance it. While all of these operations should be reasonably fast, it’s definitely more work than what was being done before. I think the regression in performance we see as a result of a higher GOMAXPROCS arises out of the fact that the heap lock is held when a span is freed, and my change causes this benchmark to spend more time holding the heap lock.

So, for a microbenchmark like this which really hammers on the GC and the allocator, it makes sense to me why my change had such an impact. Running other benchmarks (such as the go1 benchmarks, or the garbage benchmark, whose numbers are in my change’s commit message) the actual performance hit isn’t quite so dramatic. I quickly ran github.com/dr2chase/bent (which runs a number of benchmarks from actual Go programs in the wild) with my change and without it and I haven’t yet seen any significant change there either. However, I will run those benchmarks with more iterations and report back with actual numbers.

The motivation behind my change was that it really simplified a number of changes I just landed today for #14045. I’m erring on the side that it was worth it, as it seems as though in practice the performance impact of my change appears to be relatively small.

mknyszek on Oct 30, 2018

Bisect to https://github.com/golang/go/commit/07e738ec32025da458cdf968e4f991972471e6e9

reusee on Oct 30, 2018