go: archive/zip: Performance regression with reading ZIP files with many entries

What version of Go are you using (go version)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/stanhu/.cache/go-build"
GOENV="/home/stanhu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/stanhu/.gvm/pkgsets/go1.17.1/global/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/stanhu/.gvm/pkgsets/go1.17.1/global"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/stanhu/.gvm/gos/go1.17.1"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/stanhu/.gvm/gos/go1.17.1/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4184767189=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We use HTTP Range Requests to seek around an externally stored ZIP file to extract the directory listing. For example:

package main

import (
	"archive/zip"
	"fmt"
	"net/http"

	"github.com/jfbus/httprs"
)

func main() {
	url := "https://stanhu-s3-test-bucket.s3.us-west-1.amazonaws.com/test.zip"
	req, err := http.NewRequest(http.MethodGet, url, nil)
	if err != nil {
		panic(err)
	}

	httpClient := &http.Client{}

	resp, err := httpClient.Do(req)
	if err != nil || resp.StatusCode != http.StatusOK {
		panic(err)
	}

	rs := httprs.NewHttpReadSeeker(resp, httpClient)
	defer resp.Body.Close()
	defer rs.Close()

	archive, err := zip.NewReader(rs, resp.ContentLength)
	if err != nil {
		panic(err)
	}

	for _, entry := range archive.File {
		fmt.Println(entry.Name)
	}

	fmt.Printf("Range Requests: %d\n", rs.Requests)
}

What did you expect to see?

With Go v1.16.4:

# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/projects_json/pipelines_empty.json
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 2

real	0m0.277s
user	0m0.060s
sys	0m0.028s

What did you see instead?

# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 78

real	0m9.547s
user	0m0.239s
sys	0m0.154s

It appears the call to f.readDataDescriptor is causing these extra 76 HTTP Range Requests: https://github.com/golang/go/blob/ddb648fdf6c21e7e56a2252df3e3913a212ca4ab/src/archive/zip/reader.go#L120

This was added in ddb648fdf6c21e7e56a2252df3e3913a212ca4ab for https://github.com/golang/go/issues/34974.

Granted that not many people may be using the library with HTTP Range Requests, but this is compatible with the interface. This problem caused a production incident as our Google Cloud Storage request rate jumped from 2000/s to over 30,000/s (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5521).

The additional reads also slows down opening ZIP files, especially on a networked filesystem. The OpenRaw() interface doesn’t really work as a substitute because the HTTP client uses io.Reader, so we’d need to buffer the output to a file, which we’re trying to avoid.

@escholtz @ianlancetaylor I realize this parsing of the data descriptors might have been done to support the raw methods/ ZIP64, but is there a way we can skip this since we’re primarily interested in extracting the central directory header?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 11
  • Comments: 16 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Would it be possible for your use case to read the file from GCS first and then open it from memory? Or does that violate your bandwidth and local storage requirements?

We really need to be operating from the HTTP range requests for the functionality impacted here, as it is directly related to our CI artifacts, as uploaded by GitLab Runner instances. There are many projects that generate rather large artifacts in public/private instances, and it would be best not to force the transit round trip when we really just need the metadata.