go: strings: 10-30% speed regression in Contains from 1.13 to tip

What version of Go are you using (go version)?

$ go version
go version go1.13.4 linux/amd64
go version devel +8cf5293c Tue Nov 19 06:10:03 2019 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jake/.cache/go-build"
GOENV="/home/jake/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/jake/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/jake/testproj/strtest/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build998582668=/tmp/go-build -gno-record-gcc-switches"

What did you do?

package strtest

import (
	"strconv"
	"strings"
	"testing"
)

var sink1 bool

func BenchmarkContains(b *testing.B) {
	tests := []string{
		"This is just a cool test.",
		"This is just a cool test. (Or is it?)",
		"Hello, (_USER_)!",
		"RIP count reset to (_VARS_RIP-(_GAME_CLEAN_)_SET_0_), neat.",
		"(_USER_) rubs on (_PARAMETER_) 's booty! PogChamp / (_(_|",
	}

	for i, test := range tests {
		b.Run(strconv.Itoa(i), func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				sink1 = strings.Contains(test, "(_")
			}
		})
	}
}

Benchmark this against 1.13 and tip.

What did you expect to see?

No difference in performance (or hopefully better).

What did you see instead?

Contains is consistently slower. benchstat-ing the above with go test -run=- -bench . -cpu=1,4 -count=10 .:

name          old time/op  new time/op   delta
Contains/0    10.1ns ± 2%   11.4ns ± 1%  +12.94%  (p=0.000 n=10+10)
Contains/0-4  10.1ns ± 0%   11.4ns ± 2%  +12.57%  (p=0.000 n=7+10)
Contains/1    12.1ns ± 1%   13.0ns ± 1%   +7.56%  (p=0.000 n=9+9)
Contains/1-4  12.1ns ± 2%   12.8ns ± 2%   +5.48%  (p=0.000 n=8+10)
Contains/2    7.70ns ± 3%   9.82ns ± 6%  +27.52%  (p=0.000 n=10+10)
Contains/2-4  7.82ns ± 3%   9.64ns ± 2%  +23.25%  (p=0.000 n=9+10)
Contains/3    9.31ns ± 1%  11.01ns ± 1%  +18.22%  (p=0.000 n=9+10)
Contains/3-4  9.45ns ± 3%  11.11ns ± 2%  +17.56%  (p=0.000 n=9+9)
Contains/4    7.61ns ± 1%   9.87ns ± 1%  +29.60%  (p=0.000 n=8+9)
Contains/4-4  7.81ns ± 3%   9.73ns ± 3%  +24.69%  (p=0.000 n=10+9)

This has come out of some performance testing between 1.13 and tip in a project I’m working on which shows a consistent 6% regression for a hand-written parser (https://github.com/hortbot/hortbot/blob/master/internal/cbp/cbp_test.go), and this was the first thing I noticed while comparing the profiles. The first thing the parser does is check for one of two tokens before doing any work, with the assumption that if neither are present, there’s no work to be done.

In the grand scheme of things, maybe a few extra nanoseconds isn’t such a big deal, but I haven’t tested other functions in strings quite yet.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (20 by maintainers)

Most upvoted comments

What CPU does run on your dev machine and at what CL are you building at tip?

Any CL can potentially cause the benchmark code to align differently and then benchmark the effects of branch alignment. I have seen in the past that it can matter a lot where the benchmark loop is placed in the file which can cause different alignment.

With all the side channel attacks on caching/branch prediction there can also be benchmark differences due to different microcode versions of the same CPU: https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode&num=1