gsutil: gsutil cp -Z always force adds `Cache-control: no-transform` and `Content-Encoding: gzip`. Breaks http protocol

Curl client should be able to receive the unzipped version, but GCS always returns content-encoding gzip. This breaks HTTP/1.1 protocol since the client didn’t give "Accept-Encoding: gzip, deflate, br" header.

$ gsutil -m -h "Cache-Control: public,max-age=31536000" cp -Z foo.txt gs://somebucket/foo.txt
Copying file://foo.txt [Content-Type=text/plain]...
- [1/1 files][   42.0 B/   12.0 B] 100% Done
Operation completed over 1 objects/12.0 B.

$ curl -v somebucket.io/foo.txt
> GET /foo1.txt HTTP/1.1
> User-Agent: curl/7.37.0
> Host: somebucket.io
> Accept: */*
> 
< HTTP/1.1 200 OK
< X-GUploader-UploadID: ...
< Date: Thu, 19 Oct 2017 18:04:05 GMT
< Expires: Fri, 19 Oct 2018 18:04:05 GMT
< Last-Modified: Thu, 19 Oct 2017 18:03:47 GMT
< ETag: "c35fdf2f0c2dcadc46333b0709c87e64"
< x-goog-generation: 1508436227151587
< x-goog-metageneration: 1
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 42
< Content-Type: text/plain
< Content-Encoding: gzip
< x-goog-hash: crc32c=V/9tDw==
< x-goog-hash: md5=w1/fLwwtytxGMzsHCch+ZA==
< x-goog-storage-class: MULTI_REGIONAL
< Accept-Ranges: bytes
< Content-Length: 42
< Access-Control-Allow-Origin: *
* Server UploadServer is not blacklisted
< Server: UploadServer
< Age: 2681
< Cache-Control: public,max-age=31536000,no-transform
<                                                                                                                                                                                   ��Y�tmpG1oc6S�H���W(�/�I���9�

Seems to be happening here

https://github.com/GoogleCloudPlatform/gsutil/blob/e8154bab37ad896b1e1ab01f452ac3284c7051d4/gslib/copy_helper.py#L1741-L1759

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 1
  • Comments: 36 (12 by maintainers)

Most upvoted comments

@starsandskies thanks 😃

Yes, I see the problem on that nodejs-storage issue. I think though it breaks into two usecases:

  • using client libraries / gsutil to download files that have already been uploaded, where I can see transitive decompression is a problem for validating the checksums. Appreciate that’s probably blocked on a server-side fix.

  • using gsutil to upload files to a one-way bucket used for e.g. static website / asset hosting, where end-clients are accessing over http, so the checksum validation on download is not a problem but the forced override of cache headers at upload time is.

AFAICS the second usecase was working without any problems, until the gsutil behaviour was changed to fix the first case.

The key thing is that it’s obviously still valid to have gzipped files in the bucket with transitive decompression enabled - nothing stops you setting your own Cache-Control header after the initial upload. And that obviously fixes usecase 2 but breaks usecase 1. That being the case, I don’t think there’s any good reason why gsutil should silently prevent you from doing that in a single call, even if you want to keep the default behaviour as it is now.

The Object Transcoding documentation and gsutil cp documentation should probably be modified to indicate that gsutil cp -z disables decompressive transcoding.

Thanks @dalbani - I think at this point the problem is well understood and we’re waiting for the Cloud Storage team to prioritize a fix (but to my knowledge it’s not currently prioritized).

@starsandskies thanks for the response - no, I can’t reopen, only core contributors/admins can reopen on Github.

I couldn’t see a branch / pull request relevant to the underlying behaviour that necessitates the -z behaviour, do you mean server-side on GCS as mentioned up the thread, or is there an issue / pull request open for that elsewhere that I could reference / add to?

I’m happy to make a new issue, tho I think the issue description and first couple of comments here (e.g. https://github.com/GoogleCloudPlatform/gsutil/issues/480#issuecomment-338050378) capture the problem and IMO there’s an advantage to keeping this issue alive since there are already people watching it.

But if you’d prefer a new issue I’ll open one and reference this.

I am seeing a buggy behaviour too where setmeta Cache-Control overrides gzipping functionality

gsutil cp -Z foo.min.js gs://cdn-bucket/foo.min.js

accept-ranges:bytes
access-control-allow-origin:*
alt-svc:clear
cache-control:no-transform <---- undesired
content-encoding:gzip <----- correct
content-language:en
content-length:7074
content-type:application/javascript
date:Wed, 21 Feb 2018 01:21:27 GMT

After gsutil setmeta -h "Cache-Control: public,max-age=31536000" gs://cdn-bucket/foo.min.js

accept-ranges:bytes
access-control-allow-origin:*
age:127807
alt-svc:clear
cache-control:public,max-age=31536000
content-language:en
content-length:31684 <------- No content encoding gzip :(
content-type:text/css
date:Mon, 19 Feb 2018 14:01:02 GMT
etag:"691cfcaa0eb97e1f3c7d4b1687b37834"
expires:Tue, 19 Feb 2019 14:01:02 GMT
last-modified:Tue, 24 Oct 2017 00:48:44 GMT
server:UploadServer
status:200

So @thobrla it seems your recommendation of setmeta afterwards does not work.