rclone: Google Cloud Storage: Can't download files with Content-Encoding: gzip

What is the problem you are having with rclone?

rclone is unable to download a file from Google Cloud Storage which has Content-Encoding: gzip due to size mismatch. (Or MD5 mismatch when copying to Azure.)

What is your rclone version (output from rclone version)

Reproduced with both the Debian-packaged version:

rclone v1.41
- os/arch: linux/amd64
- go version: go1.10.1

and built from current git master:

rclone v1.44-001-g67703a73-beta
- os/arch: linux/amd64
- go version: go1.10.4

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Debian GNU/Linux 4.18.0 x86_64

Which cloud storage system are you using? (eg Google Drive)

Google Cloud Storage

The command you were trying to run (eg rclone copy /tmp remote:tmp)

echo 'Example content.' > file.txt
gsutil cp -Z file.txt gs://$bucket/file.txt.gz
rclone -vv copy $gcs_remote:$bucket/file.txt.gz file.txt.gz

A log from the command with the -vv flag (eg output from rclone -vv copy /tmp remote:tmp)

2018/10/15 15:46:26 DEBUG : rclone: Version "v1.44-001-g67703a73-beta" starting with parameters ["./rclone" "-vv" "copy" "gcs:9718a7ca-c0d4-41ac-a0dc-46922b9d541d/file.txt.gz" "file.txt.gz"]
2018/10/15 15:46:26 DEBUG : Using config file from "/home/kevin/.config/rclone/rclone.conf"
2018/10/15 15:46:27 DEBUG : file.txt.gz: Couldn't find file - need to transfer
2018/10/15 15:46:27 ERROR : file.txt.gz: corrupted on transfer: sizes differ 47 vs 17
2018/10/15 15:46:27 INFO  : file.txt.gz: Removing failed copy
2018/10/15 15:46:27 ERROR : Attempt 1/3 failed with 1 errors and: corrupted on transfer: sizes differ 47 vs 17
2018/10/15 15:46:27 DEBUG : file.txt.gz: Couldn't find file - need to transfer
2018/10/15 15:46:28 ERROR : file.txt.gz: corrupted on transfer: sizes differ 47 vs 17
2018/10/15 15:46:28 INFO  : file.txt.gz: Removing failed copy
2018/10/15 15:46:28 ERROR : Attempt 2/3 failed with 1 errors and: corrupted on transfer: sizes differ 47 vs 17
2018/10/15 15:46:28 DEBUG : file.txt.gz: Couldn't find file - need to transfer
2018/10/15 15:46:28 ERROR : file.txt.gz: corrupted on transfer: sizes differ 47 vs 17
2018/10/15 15:46:28 INFO  : file.txt.gz: Removing failed copy
2018/10/15 15:46:28 ERROR : Attempt 3/3 failed with 1 errors and: corrupted on transfer: sizes differ 47 vs 17
2018/10/15 15:46:28 Failed to copy: corrupted on transfer: sizes differ 47 vs 17

I’m guessing that the problem is that GCS reports Content-Length: 47 (the length of the gzip-encoded content) while rclone is using 17 decompressed bytes. (Note: The compressed content is actually larger due to format overhead.) Perhaps a call to ReadCompressed(true) to disable decompression by the Google Client Library would be appropriate?

Thanks, Kevin

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 23 (18 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve merged this to master now which means it will be in the latest beta in 15-30 minutes and released in v1.59

@panthony if you’d like to see this change for s3 &or azureblob then please make a new issue - thank you.

Passing --header-download “Accept-Encoding: gzip” works around the error

That is a good workaround and is equivalent to the patch above.

I had another idea about this here

v1.56.0-beta.5387.42a7efc4a.fix-2658-gcs-gzip-unknown-size on branch fix-2658-gcs-gzip-unknown-size (uploaded in 15-30 mins)

This modifies the patch above and if a gzipped object is detected then

  • clear the hash
  • mark the object as unknown size

This would let them be downloaded decompressed.

So there are two approaches

  1. download the files uncompressed - md5sum not checked, but crc in gzip is checked
  2. download the files compressed - md5sum is checked

Do you think rclone should engage 1) automatically with 2) being an option?

That sounds great! I tested the beta and it successfully downloaded the compressed file and preserved the MD5. Excellent! Thank you!

😃

However, I noticed that it doesn’t preserve the Content-Encoding (or custom metadata) when copying from GCS to Azure. Should I open a separate issue for that, or would you like to discuss in this one?

There are two sorts of metadata, general purpose key, value storage and what I’ll call HTTP metadata

  • Access control metadata
  • Cache-Control
  • Content-Disposition
  • Content-Encoding
  • Content-Language
  • Content-Type

Rclone deals with Content-Type already but it doesn’t deal with the other kinds of metadata.

There is an issue already about custom metadata #111. That is quite an old issue but I think it would be much easier to implement now-a-days. You’ll see various other issues linked from there.

It would be nice if rclone could copy metadata from cloud to cloud, and also set it on upload. This would be a reasonably big project though!

The way it would work is that I’d give each Object an optional interface ReadMetadata and that would be supplied. On upload this would be read and set on the object.

In either case, I will write up more details and a test case.

That would be useful - can you put it in #111. I’ll move that issue up into the run Q since I think its time has come 😃

I think the only thing I’m not sure about is how to represent the http metadata and the non http metadata in a cross cloud sort of way. Perhaps ReadMetadata should return two dictionaries or one dictionary and one http.Header

Thinking aloud this could also subsume the Content-Type mechanism.

Docs

This could also (with a flag) be implemented as attributes on local files.