etcher: .gz doesn't return correct file size for content above 2^32 B = 4 GB

  • 1.0.0-beta13
  • Linux 64bit

Taking it out from #629, apparently the gzip file format cannot accurately return the size of files above 4GB (2^32 bytes), but returns the modulo.

Looks like on the command line people recommended something like zcat file.gz | wc -c or gzip -dc file | wc -c which give the correct value - though then decompresses the the file twice. Might have to do that for gzip in the end, though, since likely >4GB files are common for Etcher’s use case.

This might let images to start to be burned onto cards that are too small (in worst case), or affects the progress bar.

From testing with a 4100MiB > 4096MiB image, indeed .gz version lets to select a 512MB SD card, while the same file’s .xz archive does not. For the progress bar, the MB/s reading seems to be affected (shows very low speed, eg. 0.01MB/s) but the progress percentage does not (shows correctly for the burning process), so it’s not too bad.

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 42 (38 by maintainers)

Most upvoted comments

I know. So you could say gzip is not the best choice for files > 4 GB? I prefer xz over gzip.

Beyond all this though, we should have a heuristic that basically says this:

  • gzip compresses images within a certain range (e.g. 1.5x to 3x)
  • if an image claims to be much out of that range (e.g. says it’s 300mb but the archive is 2.5gb) we should assume it’s actually 4.3gb instead. Essentially, we should add 2^32 bytes to the estimated size again and again, until the compression ratio gets within a realistic range.

I think an algorithm like this, used only for gzip files (maybe bzip too?), should fix the vast majority of the cases. We should still fail well when we’re wrong, but we should try hard to be right 😃

Alexandros Marinos

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Fri, Mar 3, 2017 at 3:51 PM, Juan Cruz Viotti notifications@github.com wrote:

Yeah, I don’t know. Maybe it also depends on the image itself? I’m a compression noob, so I have no clue apart from what I saw on my experiments.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/resin-io/etcher/issues/638#issuecomment-284102753, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLUCKLN91vbMP2jKNkGcuVVVW_-u52jks5riKeIgaJpZM4Joo50 .

@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible ENOSPC) that happens regardless of compression.

In this issue:

  • ENOSPC would happen if there’s a gz image with SIZE > 2^32 bytes, and the user is trying to burn onto a card which has CAPACITY < SIZE but CAPACITY > SIZE mod 2^32. Then it would cause an issue, because the initial capacity check in etcher couldn’t figure out the correct size
  • If using a card that CAPACITY > SIZE, then the only effect is that the “speed” bar is wrong, but everything else works properly (including the progress bar), and the user won’t run into ENOSPC.

Decompressing things twice might be a bad solution, but judging by the comments, it’s just gz not designed for these big files (nor to return correct size estimate either), so curious to see if there’s any other solution than run through the file twice. Should not be too bad, especially if that has some UI display such as “checking archive contents”, so people know it not all just hung. I think doing things correctly for gz is more important than taking a bit longer time. Decompression itself with no data storage just byte counting seems to be pretty fast.