go: archive/zip: Performance regression with reading ZIP files with many entries
What version of Go are you using (go version
)?
$ go version go version go1.17.1 linux/amd64
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/stanhu/.cache/go-build" GOENV="/home/stanhu/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/stanhu/.gvm/pkgsets/go1.17.1/global/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/stanhu/.gvm/pkgsets/go1.17.1/global" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/stanhu/.gvm/gos/go1.17.1" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/stanhu/.gvm/gos/go1.17.1/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.17.1" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/dev/null" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4184767189=/tmp/go-build -gno-record-gcc-switches"
What did you do?
We use HTTP Range Requests to seek around an externally stored ZIP file to extract the directory listing. For example:
package main
import (
"archive/zip"
"fmt"
"net/http"
"github.com/jfbus/httprs"
)
func main() {
url := "https://stanhu-s3-test-bucket.s3.us-west-1.amazonaws.com/test.zip"
req, err := http.NewRequest(http.MethodGet, url, nil)
if err != nil {
panic(err)
}
httpClient := &http.Client{}
resp, err := httpClient.Do(req)
if err != nil || resp.StatusCode != http.StatusOK {
panic(err)
}
rs := httprs.NewHttpReadSeeker(resp, httpClient)
defer resp.Body.Close()
defer rs.Close()
archive, err := zip.NewReader(rs, resp.ContentLength)
if err != nil {
panic(err)
}
for _, entry := range archive.File {
fmt.Println(entry.Name)
}
fmt.Printf("Range Requests: %d\n", rs.Requests)
}
What did you expect to see?
With Go v1.16.4:
# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/projects_json/pipelines_empty.json
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 2
real 0m0.277s
user 0m0.060s
sys 0m0.028s
What did you see instead?
# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 78
real 0m9.547s
user 0m0.239s
sys 0m0.154s
It appears the call to f.readDataDescriptor
is causing these extra 76 HTTP Range Requests: https://github.com/golang/go/blob/ddb648fdf6c21e7e56a2252df3e3913a212ca4ab/src/archive/zip/reader.go#L120
This was added in ddb648fdf6c21e7e56a2252df3e3913a212ca4ab for https://github.com/golang/go/issues/34974.
Granted that not many people may be using the library with HTTP Range Requests, but this is compatible with the interface. This problem caused a production incident as our Google Cloud Storage request rate jumped from 2000/s to over 30,000/s (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5521).
The additional reads also slows down opening ZIP files, especially on a networked filesystem. The OpenRaw()
interface doesn’t really work as a substitute because the HTTP client uses io.Reader
, so we’d need to buffer the output to a file, which we’re trying to avoid.
@escholtz @ianlancetaylor I realize this parsing of the data descriptors might have been done to support the raw methods/ ZIP64, but is there a way we can skip this since we’re primarily interested in extracting the central directory header?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 11
- Comments: 16 (9 by maintainers)
Links to this issue
Commits related to this issue
- archive/zip: lazy load file data descriptor Fixes #48374 Signed-off-by: Vinicius Tinti <viniciustinti@gmail.com> — committed to tinti/go by tinti 3 years ago
- archive/zip: load file data descriptor on demand Read file data descriptor only when the file is open. Previously, during zip.Reader.init, each new file had its data descriptor read during the file ... — committed to tinti/go by tinti 3 years ago
- archive/zip: load file data descriptor on demand Read file data descriptor only when the file is open. Previously, during zip.Reader.init, each new file had its data descriptor read during the file ... — committed to tinti/go by tinti 3 years ago
We really need to be operating from the HTTP range requests for the functionality impacted here, as it is directly related to our CI artifacts, as uploaded by GitLab Runner instances. There are many projects that generate rather large artifacts in public/private instances, and it would be best not to force the transit round trip when we really just need the metadata.