go: runtime: unaligned jumps causing performance regression on Intel

What version of Go are you using (`go version`)?

λ go version
go version go1.13 windows/amd64

And Go 1.14-RC1.

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

λ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\klaus\AppData\Local\go-build
set GOENV=C:\Users\klaus\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=c:\gopath
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=c:\go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=c:\go\pkg\tool\windows_amd64
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=c:\temp\wintemp\go-build453787042=/tmp/go-build -gno-record-gcc-switches

What did you do?

Isolated code: reproducer.zip

go test -bench=. -test.benchtime=10s

Most of the code is needed for the test setup, only (*tokens).EstimatedBits and mFastLog2 is run during the benchmark.

λ benchcmp go113.txt go114.txt
benchmark                             old ns/op     new ns/op     delta
Benchmark_tokens_EstimatedBits-12     663           716           +7.99%

benchmark                             old MB/s     new MB/s     speedup
Benchmark_tokens_EstimatedBits-12     1.51         1.40         0.93x

What did you expect to see?

Equivalent performance.

What did you see instead?

8% performance regression.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 20 (19 by maintainers)

Commits related to this issue

Added klauspost https://github.com/golang/go/issues/37121 — committed to dr2chase/benchmarks by dr2chase 4 years ago

Most upvoted comments

The remaining performance issue is related to unaligned jumps. We will not be making changes to our jumps before the 1.14 release, but likely will address it in a point release.

I’ll remove the release-blocker label, and update the milestone to 1.15. Please comment if I am mistaken.

toothrot on Feb 20, 2020

For convenience, here’s the benchmark (without testdata and other files in .zip):

package flate

import (
	"io/ioutil"
	"testing"
)

func Benchmark_tokens_EstimatedBits(b *testing.B) {
	tok := loadTestTokens(b)
	b.ResetTimer()
	// One "byte", one token iteration.
	b.SetBytes(1)
	for i := 0; i < b.N; i++ {
		_ = tok.EstimatedBits()
	}
}

// ...

token.go (with tokens.EstimatedBits code)

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package flate

import (
	"bytes"
	"encoding/binary"
	"fmt"
	"io"
	"math"
)


// EstimatedBits will return an minimum size estimated by an *optimal*
// compression of the block.
// The size of the block
func (t *tokens) EstimatedBits() int {
	shannon := float32(0)
	bits := int(0)
	nMatches := 0
	if t.nLits > 0 {
		invTotal := 1.0 / float32(t.nLits)
		for _, v := range t.litHist[:] {
			if v > 0 {
				n := float32(v)
				shannon += -mFastLog2(n*invTotal) * n
			}
		}
		// Just add 15 for EOB
		shannon += 15
		for i, v := range t.extraHist[1 : literalCount-256] {
			if v > 0 {
				n := float32(v)
				shannon += -mFastLog2(n*invTotal) * n
				bits += int(lengthExtraBits[i&31]) * int(v)
				nMatches += int(v)
			}
		}
	}
	if nMatches > 0 {
		invTotal := 1.0 / float32(nMatches)
		for i, v := range t.offHist[:offsetCodeCount] {
			if v > 0 {
				n := float32(v)
				shannon += -mFastLog2(n*invTotal) * n
				bits += int(offsetExtraBits[i&31]) * int(v)
			}
		}
	}
	return int(shannon) + bits
}

// from https://stackoverflow.com/a/28730362
func mFastLog2(val float32) float32 {
	ux := int32(math.Float32bits(val))
	log2 := (float32)(((ux >> 23) & 255) - 128)
	ux &= -0x7f800001
	ux += 127 << 23
	uval := math.Float32frombits(uint32(ux))
	log2 += ((-0.34484843)*uval+2.02466578)*uval - 0.67487759
	return log2
}

const (
	lengthShift = 22
	offsetMask  = 1<<lengthShift - 1
	typeMask    = 3 << 30
	matchType   = 1 << 30
	literalCount = 286
	maxMatchLength   = 258 // The longest match for the compressor
	maxStoreBlockSize   = 65535
	debugDecode = false
	baseMatchOffset = 1              // The smallest match offset
	baseMatchLength = 3              // The smallest match length per the RFC section 3.2.5
	maxMatchOffset  = 1 << 15        // The largest match offset
	offsetCodeCount = 30
	endBlockMarker = 256
)


// The length code for length X (MIN_MATCH_LENGTH <= X <= MAX_MATCH_LENGTH)
// is lengthCodes[length - MIN_MATCH_LENGTH]
var lengthCodes = [256]uint8{
	0, 1, 2, 3, 4, 5, 6, 7, 8, 8,
	9, 9, 10, 10, 11, 11, 12, 12, 12, 12,
	13, 13, 13, 13, 14, 14, 14, 14, 15, 15,
	15, 15, 16, 16, 16, 16, 16, 16, 16, 16,
	17, 17, 17, 17, 17, 17, 17, 17, 18, 18,
	18, 18, 18, 18, 18, 18, 19, 19, 19, 19,
	19, 19, 19, 19, 20, 20, 20, 20, 20, 20,
	20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
	21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
	21, 21, 21, 21, 21, 21, 22, 22, 22, 22,
	22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
	22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
	23, 23, 23, 23, 23, 23, 23, 23, 24, 24,
	24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
	24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
	24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
	25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
	25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
	25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
	25, 25, 26, 26, 26, 26, 26, 26, 26, 26,
	26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
	26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
	26, 26, 26, 26, 27, 27, 27, 27, 27, 27,
	27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
	27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
	27, 27, 27, 27, 27, 28,
}

// lengthCodes1 is length codes, but starting at 1.
var lengthCodes1 = [256]uint8{
	1, 2, 3, 4, 5, 6, 7, 8, 9, 9,
	10, 10, 11, 11, 12, 12, 13, 13, 13, 13,
	14, 14, 14, 14, 15, 15, 15, 15, 16, 16,
	16, 16, 17, 17, 17, 17, 17, 17, 17, 17,
	18, 18, 18, 18, 18, 18, 18, 18, 19, 19,
	19, 19, 19, 19, 19, 19, 20, 20, 20, 20,
	20, 20, 20, 20, 21, 21, 21, 21, 21, 21,
	21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
	22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
	22, 22, 22, 22, 22, 22, 23, 23, 23, 23,
	23, 23, 23, 23, 23, 23, 23, 23, 23, 23,
	23, 23, 24, 24, 24, 24, 24, 24, 24, 24,
	24, 24, 24, 24, 24, 24, 24, 24, 25, 25,
	25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
	25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
	25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
	26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
	26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
	26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
	26, 26, 27, 27, 27, 27, 27, 27, 27, 27,
	27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
	27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
	27, 27, 27, 27, 28, 28, 28, 28, 28, 28,
	28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
	28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
	28, 28, 28, 28, 28, 29,
}

var offsetCodes = [256]uint32{
	0, 1, 2, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7,
	8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9,
	10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
	11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
	12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
	12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
	13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
	13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
	15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
	15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
	15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
	15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
}

// offsetCodes14 are offsetCodes, but with 14 added.
var offsetCodes14 = [256]uint32{
	14, 15, 16, 17, 18, 18, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21,
	22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
	24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
	25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
	26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
	26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
	27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
	27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
	28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
	28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
	28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
	28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
	29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
	29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
	29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
	29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
}


// The number of extra bits needed by length code X - LENGTH_CODES_START.
var lengthExtraBits = [32]int8{
	/* 257 */ 0, 0, 0,
	/* 260 */ 0, 0, 0, 0, 0, 1, 1, 1, 1, 2,
	/* 270 */ 2, 2, 2, 3, 3, 3, 3, 4, 4, 4,
	/* 280 */ 4, 5, 5, 5, 5, 0,
}


// offset code word extra bits.
var offsetExtraBits = [64]int8{
	0, 0, 0, 0, 1, 1, 2, 2, 3, 3,
	4, 4, 5, 5, 6, 6, 7, 7, 8, 8,
	9, 9, 10, 10, 11, 11, 12, 12, 13, 13,
	/* extended window */
	14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20,
}

type token uint32

type tokens struct {
	nLits     int
	extraHist [32]uint16  // codes 256->maxnumlit
	offHist   [32]uint16  // offset codes
	litHist   [256]uint16 // codes 0->255
	n         uint16      // Must be able to contain maxStoreBlockSize
	tokens    [maxStoreBlockSize + 1]token
}

func (t *tokens) Reset() {
	if t.n == 0 {
		return
	}
	t.n = 0
	t.nLits = 0
	for i := range t.litHist[:] {
		t.litHist[i] = 0
	}
	for i := range t.extraHist[:] {
		t.extraHist[i] = 0
	}
	for i := range t.offHist[:] {
		t.offHist[i] = 0
	}
}

func (t *tokens) indexTokens(in []token) {
	t.Reset()
	for _, tok := range in {
		if tok < matchType {
			t.AddLiteral(tok.literal())
			continue
		}
		t.AddMatch(uint32(tok.length()), tok.offset())
	}
}


func (t *tokens) AddLiteral(lit byte) {
	t.tokens[t.n] = token(lit)
	t.litHist[lit]++
	t.n++
	t.nLits++
}

// AddMatch adds a match to the tokens.
// This function is very sensitive to inlining and right on the border.
func (t *tokens) AddMatch(xlength uint32, xoffset uint32) {
	if debugDecode {
		if xlength >= maxMatchLength+baseMatchLength {
			panic(fmt.Errorf("invalid length: %v", xlength))
		}
		if xoffset >= maxMatchOffset+baseMatchOffset {
			panic(fmt.Errorf("invalid offset: %v", xoffset))
		}
	}
	t.nLits++
	lengthCode := lengthCodes1[uint8(xlength)] & 31
	t.tokens[t.n] = token(matchType | xlength<<lengthShift | xoffset)
	t.extraHist[lengthCode]++
	t.offHist[offsetCode(xoffset)&31]++
	t.n++
}

// AddMatchLong adds a match to the tokens, potentially longer than max match length.
// Length should NOT have the base subtracted, only offset should.
func (t *tokens) AddMatchLong(xlength int32, xoffset uint32) {
	if debugDecode {
		if xoffset >= maxMatchOffset+baseMatchOffset {
			panic(fmt.Errorf("invalid offset: %v", xoffset))
		}
	}
	oc := offsetCode(xoffset) & 31
	for xlength > 0 {
		xl := xlength
		if xl > 258 {
			// We need to have at least baseMatchLength left over for next loop.
			xl = 258 - baseMatchLength
		}
		xlength -= xl
		xl -= 3
		t.nLits++
		lengthCode := lengthCodes1[uint8(xl)] & 31
		t.tokens[t.n] = token(matchType | uint32(xl)<<lengthShift | xoffset)
		t.extraHist[lengthCode]++
		t.offHist[oc]++
		t.n++
	}
}

func (t *tokens) AddEOB() {
	t.tokens[t.n] = token(endBlockMarker)
	t.extraHist[0]++
	t.n++
}

func (t *tokens) Slice() []token {
	return t.tokens[:t.n]
}

// VarInt returns the tokens as varint encoded bytes.
func (t *tokens) VarInt() []byte {
	var b = make([]byte, binary.MaxVarintLen32*int(t.n))
	var off int
	for _, v := range t.tokens[:t.n] {
		off += binary.PutUvarint(b[off:], uint64(v))
	}
	return b[:off]
}

// FromVarInt restores t to the varint encoded tokens provided.
// Any data in t is removed.
func (t *tokens) FromVarInt(b []byte) error {
	var buf = bytes.NewReader(b)
	var toks []token
	for {
		r, err := binary.ReadUvarint(buf)
		if err == io.EOF {
			break
		}
		if err != nil {
			return err
		}
		toks = append(toks, token(r))
	}
	t.indexTokens(toks)
	return nil
}

// Returns the type of a token
func (t token) typ() uint32 { return uint32(t) & typeMask }

// Returns the literal of a literal token
func (t token) literal() uint8 { return uint8(t) }

// Returns the extra offset of a match token
func (t token) offset() uint32 { return uint32(t) & offsetMask }

func (t token) length() uint8 { return uint8(t >> lengthShift) }

// Returns the offset code corresponding to a specific offset
func offsetCode(off uint32) uint32 {
	if false {
		if off < uint32(len(offsetCodes)) {
			return offsetCodes[off&255]
		} else if off>>7 < uint32(len(offsetCodes)) {
			return offsetCodes[(off>>7)&255] + 14
		} else {
			return offsetCodes[(off>>14)&255] + 28
		}
	}
	if off < uint32(len(offsetCodes)) {
		return offsetCodes[uint8(off)]
	}
	return offsetCodes14[uint8(off>>7)]
}

/cc @aclements @randall77 @ianlancetaylor @mknyszek

dmitshur on Feb 7, 2020

Okay, now it’s closed. There’s a 0.3% slowdown (2 parts in 673), which is close enough.

benchstat -geomean go1.{13,15}.log
name \ time/op           go1.13         go1.15         delta
_tokens_EstimatedBits-4     673ns ± 2%     675ns ± 2%   ~     (p=0.166 n=24+24)

dr2chase on Jun 16, 2020

Oops, no, because of a finger-fumble that is not correct. Shut down Chrome and the IDE, try again.

dr2chase on Jun 16, 2020

It might be worthwhile quickly eliminating #35881 as the source of the regression. Is your CPU listed in the description of #35881? Are you running a recent version of the microcode? If so, you could try

Applying CL 206837 locally
export GOAMD64=“alignedjumps”
rebuilding your Go tool chain
executing $GOSRCREPO/bin/go env to check GOAMD64 is set correctly.
re-running your benchmark with the new tool chain.

markdryan on Feb 10, 2020

I get a startling regression on a Mac laptop – 70% slowdown – will look further to see if I am making some obvious mistake.

dr2chase on Feb 7, 2020