go: compress/flate: deflatefast produces corrupted output

What version of Go are you using (go version)?

$ go version

go version go1.15 linux/amd64

Does this issue reproduce with the latest release?

I’m able to repro this in since go1.15, including the latest go1.15.2.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

GO111MODULE=“off” GOARCH=“amd64” GOBIN=“/usr/local/google/home/yekuang/infra/go/bin” GOCACHE=“/usr/local/google/home/yekuang/infra/go/.cache” GOENV=“/usr/local/google/home/yekuang/.config/go/env” GOEXE=“” GOFLAGS=“” GOHOSTARCH=“amd64” GOHOSTOS=“linux” GOINSECURE=“” GOMODCACHE=“/usr/local/google/home/yekuang/infra/go/.vendor/pkg/mod” GONOPROXY=“” GONOSUMDB=“” GOOS=“linux” GOPATH=“/usr/local/google/home/yekuang/infra/go/.vendor:/usr/local/google/home/yekuang/infra/go” GOPRIVATE=“” GOPROXY=“off” GOROOT=“/usr/local/google/home/yekuang/golang/go” GOSUMDB=“sum.golang.org” GOTMPDIR=“” GOTOOLDIR=“/usr/local/google/home/yekuang/golang/go/pkg/tool/linux_amd64” GCCGO=“gccgo” AR=“ar” CC=“gcc” CXX=“g++” CGO_ENABLED=“1” GOMOD=“” CGO_CFLAGS=“-g -O2” CGO_CPPFLAGS=“” CGO_CXXFLAGS=“-g -O2” CGO_FFLAGS=“-g -O2” CGO_LDFLAGS=“-g -O2” PKG_CONFIG=“pkg-config” GOGCCFLAGS=“-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build699870689=/tmp/go-build -gno-record-gcc-switches”

What did you do?

We have compressed the same data using zlib.NewWriterLevel(writer, zlib.BestSpeed), and found that the output between go1.15 and go1.14.2 were different. This difference has eventually led to a data corruption when we uploaded the compressed file to GCS.

Here’s the minimal reproducible example I have:

package main

import (
	"bufio"
	"compress/zlib"
	"flag"
	"fmt"
	"io"
	"os"
	"path/filepath"
)

func main() {
	var filename string
	flag.StringVar(&filename, "f", "", "filename")
	flag.Parse()

	fi, err := os.Open(filename)
	if err != nil {
		panic(err)
	}
	defer fi.Close()

	outname := filepath.Base(filename) + "-gzip"
	fo, err := os.Create(outname)
	if err != nil {
		panic(err)
	}
	defer fo.Close()

	fmt.Printf("%s -> %s\n", filename, outname)
	const outBufSize = 1024 * 1024
	foWr := bufio.NewWriterSize(fo, outBufSize)
	compressor, err := zlib.NewWriterLevel(foWr, zlib.BestSpeed)

	buf := make([]byte, outBufSize*3)
	if _, err := io.CopyBuffer(compressor, fi, buf); err != nil {
		compressor.Close()
		panic(err)
	}
	// compressor needs to be closed first to flush the rest of the data
	// into the bufio.Writer
	if err := compressor.Close(); err != nil {
		panic(err)
	}
	if err := foWr.Flush(); err != nil {
		panic(err)
	}
}

The input data was too big to be shared (2.5G). But I can share the data internally (FYI, my LDAP is yekuang@).

What did you expect to see?

No difference in the compressed data between go1.14.2 and go.1.15.

What did you see instead?

Compressed data were different.

Size of the compressed data:

  • go1.14.2: 728571269
  • go1.15: 728570333

I also did a cmp, and they started to differ at byte 363266597.

Let me know if you need more information, thanks!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 39 (26 by maintainers)

Commits related to this issue

Most upvoted comments

It is unclear to me whether this issue is about the compressed output simply being different or that the compressed output is actually invalid (i.e., cannot be decompressed). Can you please clarify?

@egonelbre No, that cannot be correct. The value may not matter, that will only fix 16383 out of 16384 cases, but will still result in a false hit when the hash(0)&16383 matches the index.

The offset must be enough to invalidate it completely. I will see if I can figure out what is causing the false match.

@klauspost should lines https://github.com/golang/go/blob/master/src/compress/flate/deflatefast.go#L296 read:

	for i := range e.table[:] {
		v := e.table[i].offset - e.cur + maxMatchOffset
		if v < 0 {
			e.table[i] = tableEntry{}
			continue
		}
		e.table[i].offset = v
	}

At least this seems to fix the tests for the smaller bufferReset value.

If the output is valid (e.g., can be decompressed), then this is working as expected. The compression libraries make no guarantees that the output remains stable for all time. While that guarantee can be useful in some contexts, if unfortunately means that we can never make changes to the compression algorithm either to improve the speed or to improve the compression ratio, both which are properties that are generally considered more important than stability.