Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.gz doesn't return correct file size for content above 2^32 B = 4 GB #638

Open
imrehg opened this issue Aug 19, 2016 · 42 comments
Open

.gz doesn't return correct file size for content above 2^32 B = 4 GB #638

imrehg opened this issue Aug 19, 2016 · 42 comments

Comments

@imrehg
Copy link
Contributor

imrehg commented Aug 19, 2016

  • 1.0.0-beta13
  • Linux 64bit

Taking it out from #629, apparently the gzip file format cannot accurately return the size of
files above 4GB (2^32 bytes), but returns the modulo.

Looks like on the command line people recommended something like zcat file.gz | wc -c or gzip -dc file | wc -c which give the correct value - though then decompresses the the file twice. Might have to do that for gzip in the end, though, since likely >4GB files are common for Etcher's use case.

This might let images to start to be burned onto cards that are too small (in worst case), or affects the progress bar.

From testing with a 4100MiB > 4096MiB image, indeed .gz version lets to select a 512MB SD card, while the same file's .xz archive does not.
For the progress bar, the MB/s reading seems to be affected (shows very low speed, eg. 0.01MB/s) but the progress percentage does not (shows correctly for the burning process), so it's not too bad.

@jviotti
Copy link
Contributor

jviotti commented Aug 21, 2016

So as far as I understand, the user will eventually hit ENOSPC, without any further weird behaviour, right?

I would love to be able to reliably get the drive size (or at least closely estimate), however decompressing the whole thing twice sounds like a very bad solution.

I'll research on this and see if there is a way we can do it.

@imrehg
Copy link
Contributor Author

imrehg commented Aug 22, 2016

@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible ENOSPC) that happens regardless of compression.

In this issue:

  • ENOSPC would happen if there's a gz image with SIZE > 2^32 bytes, and the user is trying to burn onto a card which has CAPACITY < SIZE but CAPACITY > SIZE mod 2^32. Then it would cause an issue, because the initial capacity check in etcher couldn't figure out the correct size
  • If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

Decompressing things twice might be a bad solution, but judging by the comments, it's just gz not designed for these big files (nor to return correct size estimate either), so curious to see if there's any other solution than run through the file twice. Should not be too bad, especially if that has some UI display such as "checking archive contents", so people know it not all just hung. I think doing things correctly for gz is more important than taking a bit longer time. Decompression itself with no data storage just byte counting seems to be pretty fast.

@jviotti
Copy link
Contributor

jviotti commented Aug 22, 2016

  • If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

I see. I wonder why the speed is wrong. I can't think of a way this
issue could affect the speed (unless GZ files >4GB are very slow to
decompress?).

Decompressing things twice might be a bad solution, but judging by the
comments, it's just gz not designed for these big files (nor to
return correct size estimate either), so curious to see if there's any
other solution than run through the file twice. Should not be too bad,
especially if that has some UI display such as "checking archive
contents", so people know it not all just hung. I think doing things
correctly for gz is more important than taking a bit longer time.
Decompression itself with no data storage just byte counting seems to
be pretty fast.

Yeah, could be, but needs more thought. I'd love to push a bit more to
see if we can find an alternative solution. We're putting an enourmous
amount of effort to reduce the time from getting an image to the drive,
and it sounds counter-intuitive to be willing to spend time
decompressing things twice.

On Sun, Aug 21, 2016 at 07:57:09PM -0700, Gergely Imreh wrote:

@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible ENOSPC) that happens regardless of compression.

In this issue:

  • ENOSPC would happen if there's a gz image with SIZE > 2^32 bytes, and the user is trying to burn onto a card which has CAPACITY < SIZE but CAPACITY > SIZE mod 2^32. Then it would cause an issue, because the initial capacity check in etcher couldn't figure out the correct size
  • If using a card that CAPACITY > SIZE, then the only effect is that the "speed" bar is wrong, but everything else works properly (including the progress bar), and the user won't run into ENOSPC.

Decompressing things twice might be a bad solution, but judging by the comments, it's just gz not designed for these big files (nor to return correct size estimate either), so curious to see if there's any other solution than run through the file twice. Should not be too bad, especially if that has some UI display such as "checking archive contents", so people know it not all just hung. I think doing things correctly for gz is more important than taking a bit longer time. Decompression itself with no data storage just byte counting seems to be pretty fast.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#638 (comment)

Juan Cruz Viotti
Software Engineer

@lurch
Copy link
Contributor

lurch commented Sep 29, 2016

As has already been mentioned, if the uncompressed file is over 4GB, gzip returns its size modulo 4GB (because it only has a 32bit size field) - gzip would report a 4.5GB file as 0.5GB, a 9.3GB file as 1.3GB, and a 15.9GB file as 3.9GB. So therefore the only way to correctly get the size is to uncompress the whole file first.

However I wonder if you could use some kind of heuristic to 'guess' when gzip has wrapped the size of the compressed file, based on the size of the uncompressed file? (I'm guessing the disk images used with Etcher probably have vaguely similar compression ratios).
E.g. if the compressed .gz file is bigger than N GB, and gzip reports the uncompressed file as being M GB, perhaps you could 'deduce' that the size of the uncompressed file is actually M+4 GB?
Obviously you'd need to do some experimentation with different disk-images to work out a reliable value for N.
(and similarly if the compressed .gz file is bigger than N*2 GB, you could deduce that the uncompressed size is actually M+8 GB)
Note that this heuristic is still likely to fail though if you have a compressed disk-image with a lot of blank unpartitioned space (because blank space compresses really well.).

I see. I wonder why the speed is wrong.

Well, I guess if Etcher thinks the uncompressed file is only 0.5GB, but it's actually 4.5GB, perhaps the speed is calculated based on the 0.5GB figure? (4.5GB obviously takes a lot longer to write than 0.5GB would!)

P.S. Obviously where I've said '4GB' everywhere above, it's just a shorthand for '2^32 bytes'.

@jviotti jviotti modified the milestone: v1.0 Nov 28, 2016
@alexandrosm
Copy link
Contributor

Revisiting this issue, I am with @lurch on this. He is probably right about the heuristic approach and about the speed explanation. @jviotti?

@jviotti
Copy link
Contributor

jviotti commented Dec 2, 2016

Sounds good, lets experiment with it.

@WasabiFan
Copy link
Contributor

Obviously where I've said '4GB' everywhere above, it's just a shorthand for '2^32 bytes'.

I'd suggest that you guys watch your units more closely, especially when building UI or write logic. 2^32 bytes is actually 4GiB (gibibytes), not GB (gigabytes).

@lurch
Copy link
Contributor

lurch commented Dec 2, 2016

I can never remember which way around they are :-/

And as Etcher is aimed at novices, I expect they might get confused if we started reporting everything in GiB instead of just GB ?

@WasabiFan
Copy link
Contributor

I can never remember which way around they are :-/

The simple way to remember it is to compare it to metric measurements. If it has an SI prefix like metric units do, you know it's a power of ten (as with metric).

And as Etcher is aimed at novices, I expect they might get confused if we started reporting everything in GiB instead of just GB?

As I see it, if a novice doesn't know what gibibytes are, they'll probably just assume that GiB is "Gigabytes". That's a close enough approximation for said novice's uses, I'd imagine.

@jviotti jviotti modified the milestones: v1.0, Stability Dec 7, 2016
@alexandrosm
Copy link
Contributor

@jviotti did anyone end up doing anything with this? It really should be a fairly simple fix for 99% of cases

@alexandrosm
Copy link
Contributor

(this is our oldest bug still open)

@lurch
Copy link
Contributor

lurch commented Mar 2, 2017

Is there actually anyone using gzip to compress images over 4GB in size? I.e. does anyone have some example images that can be used for testing, or is this just an edge-case that we'll (probably) never hit in practice?

@jviotti
Copy link
Contributor

jviotti commented Mar 3, 2017

I've seen some out there, although its definitely rare. It shouldn't be hard to fix the speed + percentage issues though. I'm happy if we treat gzip sizes with a grain of salt and eventually throw ENOSPC if there is no remaining space in the drive, like we do with bzip2.

@jviotti
Copy link
Contributor

jviotti commented Mar 3, 2017

@jhermsmeier Can you help me out with this one? Try flashing an gz image directly with the Etcher CLI. The flash progress quickly reaches 99%, and remains there for quite some time while there's more data coming through the stream chain. Maybe we're not calculating the size correctly somewhere?

@jviotti
Copy link
Contributor

jviotti commented Mar 3, 2017

Oh, the sizes seem to be fine. The issue is that for compressed images, the stream chain looks like this:

  • Input compressed file
  • Calculate progress based on compressed size
  • Apply decompression transform
  • Write to drive

It looks like the decompression is very fast compared to the drive writing, and therefore the progress reaches 99% much sooner.

I'm not sure what would be the solution here. In some cases we only have the uncompressed size, so we should make use of it to show the progress. Maybe there is a way to make the initial readable stream wait for the drive writes before sending more data?

@jviotti
Copy link
Contributor

jviotti commented Mar 3, 2017

Looks like pausing/resuming the readable stream should do it.

@jviotti
Copy link
Contributor

jviotti commented Mar 3, 2017

I investigated the slow speed issue in more detail, and it looks to be resolved in master. There is a speed penalty (~2.0 MB/s) when decompressing large files though, but its not even closer to @imrehg initial report (~0.01MB/s).

The slow speed seem to happen on larger compressed images, and also depends on the compression level. I ran various experiments with images of several sizes (from ~1 GB to ~4 GB), using compression levels 1 to 9, inclusive, and the decompression time drastically increases on larger images, which can be reproduced with the gzip tool as well.

In summary:

  • Images that don't fit into the drive will eventually cause ENOSPC, leading a friendly message being presented to the user
  • The absurd speed times issue was fixed
  • We're hitting a UX issue where the decompression time is much faster than the flashing time (for certain images)

I think we can close this issue once the last one is fixed.

@jviotti
Copy link
Contributor

jviotti commented Mar 3, 2017

@jhermsmeier Check http://linorg.usp.br/OpenELEC/OpenELEC-Generic.x86_64-6.0.3.img.gz for an image that showcases the 99% UX issue.

@lurch
Copy link
Contributor

lurch commented Mar 3, 2017

the decompression time drastically increases on larger images

Isn't that expected?!? I'd expect a streaming gzip decompressor to have a constant(ish) decompression speed, so of course a larger image will take longer to decompress.
Or are you saying that larger images decompress at a slower speed (bitrate) than smaller images, in which case that sounds like it might be a resource-leak somewhere?
I'm sure @jhermsmeier , our streaming expert, will have some better ideas ;-)

In some cases we only have the uncompressed size, so we should make use of it to show the progress.

I suggested elsewhere, that since we'll only be supporting streaming images from our online catalog, the online catalog could store the size of the uncompressed image, which will solve this problem (and of course that uncompressed-image-size figure should be automatically updated, so that it never gets out-of-sync with the actual compressed image).

@jviotti
Copy link
Contributor

jviotti commented Mar 9, 2017

The current implementation calculates the percentage based on how much was decompressed from the file. For most decompression algorithms we're using, writing to the drive is faster than decompressing, so the progress bar displays fine, however gzip seems to be a special case, because its quite fast.

I avoided heuristics and relied on that approach given that for some compression methods (e.g: bzip2), its impossible to get even an estimate, and we'd be really guessing in the dark.

I believe that we can pause decompression if we're getting slow on writes, and that should work fine.

@lurch
Copy link
Contributor

lurch commented Mar 9, 2017

No such thing as threads in Node userland ;)

From what I remember reading about Electron when I first started working on Etcher, I thought it was designed around having separate co-operating processes?

if this were to run in parallel, we'd be bottlenecking the CPU

I wonder if we can assign threads/processes to have a lower priority than the main thread? Given that we're going to have lots of I/O waits waiting for the SD card to write, there should be plenty of 'spare' CPU time-slots.

or I/O if the source is slow

Well, I'd still expect even reading from a 'slow' disk to be faster than writing to a SD card?

For most decompression algorithms we're using, writing to the drive is faster than decompressing

Yikes! :-( Is that because they're implemented in pure-javascript, and JS isn't designed for manipulating binary data? How much is it slowing things down by - would it be worth trying to find native-binding equivalents? (if I'm using the right terminology)
A benchmark from 12 years ago (wow!) shows even the most highly-compressed bzip2 file decompressing at over 5MB/s. (A more recent benchmark shows bzip2 decompressing at over 20MB/s)

however gzip seems to be a special case, because its quite fast.

Maybe that points to gzip being implemented in C rather than JS?

given that for some compression methods (e.g: bzip2), its impossible to get even an estimate

Yeah, the background-decompression-thread approach I suggested above would also work for .bz2 files (although I guess it means the progress-indicator would have to switch part-way through from being a "rough estimate, based on how much of the input file has been decompressed", to "an exact figure, based on how much of the output file has been written to disk so far").

I think heuristic + counting bytes while writing is probably the less resource-intensive path.

I believe that we can pause decompression if we're getting slow on writes, and that should work fine.

Fair enough, looks like I've been out-voted ;-)

@alexandrosm
Copy link
Contributor

alexandrosm commented Mar 9, 2017 via email

@alexandrosm
Copy link
Contributor

alexandrosm commented Mar 9, 2017 via email

@lurch
Copy link
Contributor

lurch commented Mar 9, 2017

Ahhhh-hhaaaa!!! I've just realised something significant :-D

As seen in #1171 (comment) the 29476.5MiB of the android-ver6.0-20170112-pine64-32GB.img gets compressed to just 778MiB in android-ver6.0-20170112-pine64-32GB.img.gz. However (here's the important part) the data isn't distributed evenly throughout the .gz file - there's maybe 2 or 3 GB of actual disk data, and then 20+ GB of binary zeroes (which is why the image compresses so well). However it only takes a few blocks to store those 20+ GB of binary zeroes in the .gz file. So by the time all the 'real data' has been decompressed and written to the SD card, it's totally believable that we are now 95% of the way through reading the gzip file (as @jviotti explains above, the progress-indicator is based on how much of the gzip file has been read so far).
So even if the stream-backpressure stuff is working correctly, the progress indicator will display 95% progress, even though we've actually only written maybe 3GB out of the actual 25GB that we need to write to disk. And of course if the progress bar thinks we're already 95% done, but we've still got another 20+ GB of zeroes to write, that'll lead to the "apparent write speed" dropping so dramatically.

So it's not necessarily just the "gzip being quite fast" that is causing this problem, but the fact that disk images tend to have lots of empty-space at the end, which compresses super-efficiently, and takes up only a very small fraction of the compressed file, but still takes a very long time (much longer than the "real data" in the example in #1171 ) to actually write out to disk.

Does that make sense?

@lurch
Copy link
Contributor

lurch commented Mar 9, 2017

...and on the subject of background processes in Electron, I just found this - @jviotti that's a similar approach to your child-writer stuff isn't it?

@jviotti
Copy link
Contributor

jviotti commented Mar 9, 2017

Yeah, we can always create child processes and communicate through IPC, which can provide us with some sort of multi-threading.

@DG12
Copy link

DG12 commented Apr 2, 2019

How about displaying something like a wild guess, and saying so, if the compressed size is bigger than gzip claims the output file will be. Although it is possible that a compressed file is truly bigger than the output it would be much better than reporting so really bogus number like with a 5GB input reporting 866MB output!

Here's an off the wall idea: Ask the user if they know what the output size is supposed to be! I would expect in most cases they do.

Interesting side note: 2 years ago lurch commented "Is there actually anyone using gzip to compress images over 4GB in size". Today why would anyone buy a SD card smaller than 16GB!

@lucventurini
Copy link

Dear all,
if anyone is still following this bug ... in bioinformatics we have thousands of ASCII files that are (when compressed) well over 4GiB. This is a very common occurrence, and therefore, the inability of GZIP of listing the correct size of the uncompressed file is a routine nuisance.

As one of many potential examples: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR277/ERR277077/

Thank you for supporting this amazing utility!

@tipuraneo
Copy link

Recently also faced this issue with gzip 1.9
$ gzip -l filename.gz
compressed uncompressed ratio uncompressed_name
5627740354 2198646035 -156.0% filename

@lurch
Copy link
Contributor

lurch commented Aug 5, 2019

@tipuraneo It's a limitation of the .gz file-format itself, rather than a 'bug' in any particular implementation of gzip.

@tipuraneo
Copy link

I know. So you could say gzip is not the best choice for files > 4 GB? I prefer xz over gzip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants