.gz doesn't return correct file size for content above 2^32 B = 4 GB #638

imrehg · 2016-08-19T16:31:24Z

1.0.0-beta13
Linux 64bit

Taking it out from #629, apparently the gzip file format cannot accurately return the size of
files above 4GB (2^32 bytes), but returns the modulo.

Looks like on the command line people recommended something like zcat file.gz | wc -c or gzip -dc file | wc -c which give the correct value - though then decompresses the the file twice. Might have to do that for gzip in the end, though, since likely >4GB files are common for Etcher's use case.

This might let images to start to be burned onto cards that are too small (in worst case), or affects the progress bar.

From testing with a 4100MiB > 4096MiB image, indeed .gz version lets to select a 512MB SD card, while the same file's .xz archive does not.
For the progress bar, the MB/s reading seems to be affected (shows very low speed, eg. 0.01MB/s) but the progress percentage does not (shows correctly for the burning process), so it's not too bad.

The text was updated successfully, but these errors were encountered:

jviotti · 2016-08-21T22:34:05Z

So as far as I understand, the user will eventually hit ENOSPC, without any further weird behaviour, right?

I would love to be able to reliably get the drive size (or at least closely estimate), however decompressing the whole thing twice sounds like a very bad solution.

I'll research on this and see if there is a way we can do it.

imrehg · 2016-08-22T02:57:08Z

@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible ENOSPC) that happens regardless of compression.

In this issue:

ENOSPC would happen if there's a gz image with SIZE > 2^32 bytes, and the user is trying to burn onto a card which has CAPACITY < SIZE but CAPACITY > SIZE mod 2^32. Then it would cause an issue, because the initial capacity check in etcher couldn't figure out the correct size
If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

Decompressing things twice might be a bad solution, but judging by the comments, it's just gz not designed for these big files (nor to return correct size estimate either), so curious to see if there's any other solution than run through the file twice. Should not be too bad, especially if that has some UI display such as "checking archive contents", so people know it not all just hung. I think doing things correctly for gz is more important than taking a bit longer time. Decompression itself with no data storage just byte counting seems to be pretty fast.

jviotti · 2016-08-22T13:47:00Z

If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

I see. I wonder why the speed is wrong. I can't think of a way this
issue could affect the speed (unless GZ files >4GB are very slow to
decompress?).

Decompressing things twice might be a bad solution, but judging by the
comments, it's just gz not designed for these big files (nor to
return correct size estimate either), so curious to see if there's any
other solution than run through the file twice. Should not be too bad,
especially if that has some UI display such as "checking archive
contents", so people know it not all just hung. I think doing things
correctly for gz is more important than taking a bit longer time.
Decompression itself with no data storage just byte counting seems to
be pretty fast.

Yeah, could be, but needs more thought. I'd love to push a bit more to
see if we can find an alternative solution. We're putting an enourmous
amount of effort to reduce the time from getting an image to the drive,
and it sounds counter-intuitive to be willing to spend time
decompressing things twice.

On Sun, Aug 21, 2016 at 07:57:09PM -0700, Gergely Imreh wrote:

@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible ENOSPC) that happens regardless of compression.

In this issue:

ENOSPC would happen if there's a gz image with SIZE > 2^32 bytes, and the user is trying to burn onto a card which has CAPACITY < SIZE but CAPACITY > SIZE mod 2^32. Then it would cause an issue, because the initial capacity check in etcher couldn't figure out the correct size

If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

Decompressing things twice might be a bad solution, but judging by the comments, it's just gz not designed for these big files (nor to return correct size estimate either), so curious to see if there's any other solution than run through the file twice. Should not be too bad, especially if that has some UI display such as "checking archive contents", so people know it not all just hung. I think doing things correctly for gz is more important than taking a bit longer time. Decompression itself with no data storage just byte counting seems to be pretty fast.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#638 (comment)

Juan Cruz Viotti
Software Engineer

lurch · 2016-09-29T11:56:44Z

As has already been mentioned, if the uncompressed file is over 4GB, gzip returns its size modulo 4GB (because it only has a 32bit size field) - gzip would report a 4.5GB file as 0.5GB, a 9.3GB file as 1.3GB, and a 15.9GB file as 3.9GB. So therefore the only way to correctly get the size is to uncompress the whole file first.

However I wonder if you could use some kind of heuristic to 'guess' when gzip has wrapped the size of the compressed file, based on the size of the uncompressed file? (I'm guessing the disk images used with Etcher probably have vaguely similar compression ratios).
E.g. if the compressed .gz file is bigger than N GB, and gzip reports the uncompressed file as being M GB, perhaps you could 'deduce' that the size of the uncompressed file is actually M+4 GB?
Obviously you'd need to do some experimentation with different disk-images to work out a reliable value for N.
(and similarly if the compressed .gz file is bigger than N*2 GB, you could deduce that the uncompressed size is actually M+8 GB)
Note that this heuristic is still likely to fail though if you have a compressed disk-image with a lot of blank unpartitioned space (because blank space compresses really well.).

I see. I wonder why the speed is wrong.

Well, I guess if Etcher thinks the uncompressed file is only 0.5GB, but it's actually 4.5GB, perhaps the speed is calculated based on the 0.5GB figure? (4.5GB obviously takes a lot longer to write than 0.5GB would!)

P.S. Obviously where I've said '4GB' everywhere above, it's just a shorthand for '2^32 bytes'.

alexandrosm · 2016-12-02T04:00:20Z

Revisiting this issue, I am with @lurch on this. He is probably right about the heuristic approach and about the speed explanation. @jviotti?

jviotti · 2016-12-02T14:25:23Z

Sounds good, lets experiment with it.

WasabiFan · 2016-12-02T15:33:09Z

Obviously where I've said '4GB' everywhere above, it's just a shorthand for '2^32 bytes'.

I'd suggest that you guys watch your units more closely, especially when building UI or write logic. 2^32 bytes is actually 4GiB (gibibytes), not GB (gigabytes).

lurch · 2016-12-02T15:52:49Z

I can never remember which way around they are :-/

And as Etcher is aimed at novices, I expect they might get confused if we started reporting everything in GiB instead of just GB ?

WasabiFan · 2016-12-02T17:13:02Z

I can never remember which way around they are :-/

The simple way to remember it is to compare it to metric measurements. If it has an SI prefix like metric units do, you know it's a power of ten (as with metric).

And as Etcher is aimed at novices, I expect they might get confused if we started reporting everything in GiB instead of just GB?

As I see it, if a novice doesn't know what gibibytes are, they'll probably just assume that GiB is "Gigabytes". That's a close enough approximation for said novice's uses, I'd imagine.

alexandrosm · 2017-03-02T00:51:37Z

@jviotti did anyone end up doing anything with this? It really should be a fairly simple fix for 99% of cases

alexandrosm · 2017-03-02T00:53:32Z

(this is our oldest bug still open)

lurch · 2017-03-02T11:59:04Z

Is there actually anyone using gzip to compress images over 4GB in size? I.e. does anyone have some example images that can be used for testing, or is this just an edge-case that we'll (probably) never hit in practice?

jviotti · 2017-03-03T19:06:38Z

I've seen some out there, although its definitely rare. It shouldn't be hard to fix the speed + percentage issues though. I'm happy if we treat gzip sizes with a grain of salt and eventually throw ENOSPC if there is no remaining space in the drive, like we do with bzip2.

jviotti · 2017-03-03T19:08:00Z

@jhermsmeier Can you help me out with this one? Try flashing an gz image directly with the Etcher CLI. The flash progress quickly reaches 99%, and remains there for quite some time while there's more data coming through the stream chain. Maybe we're not calculating the size correctly somewhere?

jviotti · 2017-03-03T19:17:00Z

Oh, the sizes seem to be fine. The issue is that for compressed images, the stream chain looks like this:

Input compressed file
Calculate progress based on compressed size
Apply decompression transform
Write to drive

It looks like the decompression is very fast compared to the drive writing, and therefore the progress reaches 99% much sooner.

I'm not sure what would be the solution here. In some cases we only have the uncompressed size, so we should make use of it to show the progress. Maybe there is a way to make the initial readable stream wait for the drive writes before sending more data?

jviotti · 2017-03-03T19:23:18Z

Looks like pausing/resuming the readable stream should do it.

jviotti · 2017-03-03T22:07:01Z

I investigated the slow speed issue in more detail, and it looks to be resolved in master. There is a speed penalty (~2.0 MB/s) when decompressing large files though, but its not even closer to @imrehg initial report (~0.01MB/s).

The slow speed seem to happen on larger compressed images, and also depends on the compression level. I ran various experiments with images of several sizes (from ~1 GB to ~4 GB), using compression levels 1 to 9, inclusive, and the decompression time drastically increases on larger images, which can be reproduced with the gzip tool as well.

In summary:

Images that don't fit into the drive will eventually cause ENOSPC, leading a friendly message being presented to the user
The absurd speed times issue was fixed
We're hitting a UX issue where the decompression time is much faster than the flashing time (for certain images)

I think we can close this issue once the last one is fixed.

jviotti · 2017-03-03T22:09:21Z

@jhermsmeier Check http://linorg.usp.br/OpenELEC/OpenELEC-Generic.x86_64-6.0.3.img.gz for an image that showcases the 99% UX issue.

lurch · 2017-03-03T22:50:36Z

the decompression time drastically increases on larger images

Isn't that expected?!? I'd expect a streaming gzip decompressor to have a constant(ish) decompression speed, so of course a larger image will take longer to decompress.
Or are you saying that larger images decompress at a slower speed (bitrate) than smaller images, in which case that sounds like it might be a resource-leak somewhere?
I'm sure @jhermsmeier , our streaming expert, will have some better ideas ;-)

In some cases we only have the uncompressed size, so we should make use of it to show the progress.

I suggested elsewhere, that since we'll only be supporting streaming images from our online catalog, the online catalog could store the size of the uncompressed image, which will solve this problem (and of course that uncompressed-image-size figure should be automatically updated, so that it never gets out-of-sync with the actual compressed image).

jviotti · 2017-03-09T20:01:45Z

The current implementation calculates the percentage based on how much was decompressed from the file. For most decompression algorithms we're using, writing to the drive is faster than decompressing, so the progress bar displays fine, however gzip seems to be a special case, because its quite fast.

I avoided heuristics and relied on that approach given that for some compression methods (e.g: bzip2), its impossible to get even an estimate, and we'd be really guessing in the dark.

I believe that we can pause decompression if we're getting slow on writes, and that should work fine.

lurch · 2017-03-09T20:33:09Z

No such thing as threads in Node userland ;)

From what I remember reading about Electron when I first started working on Etcher, I thought it was designed around having separate co-operating processes?

if this were to run in parallel, we'd be bottlenecking the CPU

I wonder if we can assign threads/processes to have a lower priority than the main thread? Given that we're going to have lots of I/O waits waiting for the SD card to write, there should be plenty of 'spare' CPU time-slots.

or I/O if the source is slow

Well, I'd still expect even reading from a 'slow' disk to be faster than writing to a SD card?

For most decompression algorithms we're using, writing to the drive is faster than decompressing

Yikes! :-( Is that because they're implemented in pure-javascript, and JS isn't designed for manipulating binary data? How much is it slowing things down by - would it be worth trying to find native-binding equivalents? (if I'm using the right terminology)
A benchmark from 12 years ago (wow!) shows even the most highly-compressed bzip2 file decompressing at over 5MB/s. (A more recent benchmark shows bzip2 decompressing at over 20MB/s)

however gzip seems to be a special case, because its quite fast.

Maybe that points to gzip being implemented in C rather than JS?

given that for some compression methods (e.g: bzip2), its impossible to get even an estimate

Yeah, the background-decompression-thread approach I suggested above would also work for .bz2 files (although I guess it means the progress-indicator would have to switch part-way through from being a "rough estimate, based on how much of the input file has been decompressed", to "an exact figure, based on how much of the output file has been written to disk so far").

I think heuristic + counting bytes while writing is probably the less resource-intensive path.

I believe that we can pause decompression if we're getting slow on writes, and that should work fine.

Fair enough, looks like I've been out-voted ;-)

alexandrosm · 2017-03-09T20:41:42Z

Andrew -- just to clarify, I said that the heuristic approach will be *less reliable* but much much simpler to write and execute, not the other way around.

…

-- *Alexandros Marinos* Founder & CEO, Resin.io +1 206-637-5498 @alexandrosm

On Thu, Mar 9, 2017 at 12:01 PM, Juan Cruz Viotti ***@***.***> wrote: The current implementation calculates the percentage based on how much was decompressed from the file. For most decompression algorithms we're using, writing to the drive is faster than decompressing, so the progress bar displays fine, however gzip seems to be a special case, because its quite fast. I avoided heuristics and relied on that approach given that for some compression methods (e.g: bzip2), its impossible to get even an estimate, and we'd be really guessing in the dark. I believe that we can pause decompression if we're getting slow on writes, and that should work fine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#638 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLUCFdvsHaHIJu6yYJ4P5BbLNQ3P_iuks5rkFqqgaJpZM4Joo50> .

alexandrosm · 2017-03-09T20:42:52Z

The threads of electron are exactly two: one for ui/browser stuff, and one for "server"/node stuff.

…

-- *Alexandros Marinos* Founder & CEO, Resin.io +1 206-637-5498 @alexandrosm

On Thu, Mar 9, 2017 at 12:33 PM, Andrew Scheller ***@***.***> wrote: No such thing as threads in Node userland ;) From what I remember reading about Electron when I first started working on Etcher, I thought it was designed around having separate co-operating processes? if this were to run in parallel, we'd be bottlenecking the CPU I wonder if we can assign threads/processes to have a lower priority than the main thread? Given that we're going to have lots of I/O waits waiting for the SD card to write, there should be plenty of 'spare' CPU time-slots. or I/O if the source is slow Well, I'd still expect even reading from a 'slow' disk to be faster than writing to a SD card? For most decompression algorithms we're using, writing to the drive is faster than decompressing Yikes! :-( Is that because they're implemented in pure-javascript, and JS isn't designed for manipulating binary data? How much is it slowing things down by - would it be worth trying to find native-binding equivalents? (if I'm using the right terminology) A benchmark <http://tukaani.org/lzma/benchmarks.html> from 12 years ago (wow!) shows even the most highly-compressed bzip2 file decompressing at over 5MB/s. (A more recent benchmark <https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/> shows bzip2 decompressing at over 20MB/s) however gzip seems to be a special case, because its quite fast. Maybe that points to gzip being implemented in C rather than JS? given that for some compression methods (e.g: bzip2), its impossible to get even an estimate Yeah, the background-decompression-thread approach I suggested above would also work for .bz2 files (although I guess it means the progress-indicator would have to switch part-way through from being a "rough estimate, based on how much of the input file has been decompressed", to "an exact figure, based on how much of the output file has been written to disk so far"). I think heuristic + counting bytes while writing is probably the less resource-intensive path. I believe that we can pause decompression if we're getting slow on writes, and that should work fine. Fair enough, looks like I've been out-voted ;-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#638 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLUCDVlo-Ih407SAOM4MQbXReSlT0I4ks5rkGIFgaJpZM4Joo50> .

lurch · 2017-03-09T21:08:12Z

Ahhhh-hhaaaa!!! I've just realised something significant :-D

As seen in #1171 (comment) the 29476.5MiB of the android-ver6.0-20170112-pine64-32GB.img gets compressed to just 778MiB in android-ver6.0-20170112-pine64-32GB.img.gz. However (here's the important part) the data isn't distributed evenly throughout the .gz file - there's maybe 2 or 3 GB of actual disk data, and then 20+ GB of binary zeroes (which is why the image compresses so well). However it only takes a few blocks to store those 20+ GB of binary zeroes in the .gz file. So by the time all the 'real data' has been decompressed and written to the SD card, it's totally believable that we are now 95% of the way through reading the gzip file (as @jviotti explains above, the progress-indicator is based on how much of the gzip file has been read so far).
So even if the stream-backpressure stuff is working correctly, the progress indicator will display 95% progress, even though we've actually only written maybe 3GB out of the actual 25GB that we need to write to disk. And of course if the progress bar thinks we're already 95% done, but we've still got another 20+ GB of zeroes to write, that'll lead to the "apparent write speed" dropping so dramatically.

So it's not necessarily just the "gzip being quite fast" that is causing this problem, but the fact that disk images tend to have lots of empty-space at the end, which compresses super-efficiently, and takes up only a very small fraction of the compressed file, but still takes a very long time (much longer than the "real data" in the example in #1171 ) to actually write out to disk.

Does that make sense?

lurch · 2017-03-09T21:24:08Z

...and on the subject of background processes in Electron, I just found this - @jviotti that's a similar approach to your child-writer stuff isn't it?

jviotti · 2017-03-09T21:44:30Z

Yeah, we can always create child processes and communicate through IPC, which can provide us with some sort of multi-threading.

DG12 · 2019-04-02T12:33:00Z

How about displaying something like a wild guess, and saying so, if the compressed size is bigger than gzip claims the output file will be. Although it is possible that a compressed file is truly bigger than the output it would be much better than reporting so really bogus number like with a 5GB input reporting 866MB output!

Here's an off the wall idea: Ask the user if they know what the output size is supposed to be! I would expect in most cases they do.

Interesting side note: 2 years ago lurch commented "Is there actually anyone using gzip to compress images over 4GB in size". Today why would anyone buy a SD card smaller than 16GB!

lucventurini · 2019-07-31T15:36:01Z

Dear all,
if anyone is still following this bug ... in bioinformatics we have thousands of ASCII files that are (when compressed) well over 4GiB. This is a very common occurrence, and therefore, the inability of GZIP of listing the correct size of the uncompressed file is a routine nuisance.

As one of many potential examples: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR277/ERR277077/

Thank you for supporting this amazing utility!

tipuraneo · 2019-08-05T13:34:35Z

Recently also faced this issue with gzip 1.9
$ gzip -l filename.gz
compressed uncompressed ratio uncompressed_name
5627740354 2198646035 -156.0% filename

lurch · 2019-08-05T16:00:19Z

@tipuraneo It's a limitation of the .gz file-format itself, rather than a 'bug' in any particular implementation of gzip.

tipuraneo · 2019-08-05T16:12:33Z

I know. So you could say gzip is not the best choice for files > 4 GB? I prefer xz over gzip.

jviotti added platform:all type:bug labels Aug 21, 2016

jviotti modified the milestone: v1.0 Nov 28, 2016

lurch mentioned this issue Nov 28, 2016

Problem with the .gz file size ? #892

Closed

jviotti modified the milestones: v1.0, Stability Dec 7, 2016

jviotti added component:decompression component:sdk labels Mar 2, 2017

jviotti modified the milestones: Stability, v1.0 Mar 17, 2017

lurch mentioned this issue Apr 6, 2017

fix(writer): Use uncompressed or estimated size for progress #1259

Closed

jviotti modified the milestones: v1.0, Backlog May 9, 2017

lurch mentioned this issue Jun 24, 2017

Integrate usbboot into Etcher #1541

Closed

lurch mentioned this issue Jul 12, 2017

fix(writer): Use final size if it's not an estimation #1587

Merged

lurch mentioned this issue Nov 28, 2017

feat(writer): Display actual write speed #1863

Merged

lurch mentioned this issue Dec 7, 2017

ETIMEDOUT when flashing compressed image #1908

Closed

jhermsmeier mentioned this issue Jul 18, 2018

Gzip handler uses I_SIZE as definitive size balena-io-modules/etcher-sdk#25

Closed

jviotti removed the component:decompression label Oct 15, 2018

lurch mentioned this issue Apr 1, 2019

32GB images reports incorrectly but writes OK on MS Windows #2719

Closed

lurch mentioned this issue May 13, 2019

Flashing indicator jumps from 99%-100% to 47%, then after completion fails because of checksum mismatch #2780

Closed

framps mentioned this issue Jun 8, 2020

For dd restore check whether the image will fit on the SD card framps/raspiBackup#225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gz doesn't return correct file size for content above 2^32 B = 4 GB #638

.gz doesn't return correct file size for content above 2^32 B = 4 GB #638

imrehg commented Aug 19, 2016

jviotti commented Aug 21, 2016

imrehg commented Aug 22, 2016

jviotti commented Aug 22, 2016

lurch commented Sep 29, 2016

alexandrosm commented Dec 2, 2016

jviotti commented Dec 2, 2016

WasabiFan commented Dec 2, 2016

lurch commented Dec 2, 2016

WasabiFan commented Dec 2, 2016

alexandrosm commented Mar 2, 2017

alexandrosm commented Mar 2, 2017

lurch commented Mar 2, 2017

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017 •

edited

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017

lurch commented Mar 3, 2017 •

edited

jviotti commented Mar 9, 2017

lurch commented Mar 9, 2017

alexandrosm commented Mar 9, 2017 via email

alexandrosm commented Mar 9, 2017 via email

lurch commented Mar 9, 2017

lurch commented Mar 9, 2017

jviotti commented Mar 9, 2017

DG12 commented Apr 2, 2019

lucventurini commented Jul 31, 2019

tipuraneo commented Aug 5, 2019

lurch commented Aug 5, 2019

tipuraneo commented Aug 5, 2019

.gz doesn't return correct file size for content above 2^32 B = 4 GB #638

.gz doesn't return correct file size for content above 2^32 B = 4 GB #638

Comments

imrehg commented Aug 19, 2016

jviotti commented Aug 21, 2016

imrehg commented Aug 22, 2016

jviotti commented Aug 22, 2016

lurch commented Sep 29, 2016

alexandrosm commented Dec 2, 2016

jviotti commented Dec 2, 2016

WasabiFan commented Dec 2, 2016

lurch commented Dec 2, 2016

WasabiFan commented Dec 2, 2016

alexandrosm commented Mar 2, 2017

alexandrosm commented Mar 2, 2017

lurch commented Mar 2, 2017

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017 • edited

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017

jviotti commented Mar 3, 2017

lurch commented Mar 3, 2017 • edited

jviotti commented Mar 9, 2017

lurch commented Mar 9, 2017

alexandrosm commented Mar 9, 2017 via email

alexandrosm commented Mar 9, 2017 via email

lurch commented Mar 9, 2017

lurch commented Mar 9, 2017

jviotti commented Mar 9, 2017

DG12 commented Apr 2, 2019

lucventurini commented Jul 31, 2019

tipuraneo commented Aug 5, 2019

lurch commented Aug 5, 2019

tipuraneo commented Aug 5, 2019

jviotti commented Mar 3, 2017 •

edited

lurch commented Mar 3, 2017 •

edited