New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.gz doesn't return correct file size for content above 2^32 B = 4 GB #638
Comments
So as far as I understand, the user will eventually hit I would love to be able to reliably get the drive size (or at least closely estimate), however decompressing the whole thing twice sounds like a very bad solution. I'll research on this and see if there is a way we can do it. |
@jviotti I think your description is still mixing up two uncovered issues, broken out to two parts to here and to #629. In that other issue (as described the reproducible In this issue:
Decompressing things twice might be a bad solution, but judging by the comments, it's just |
I see. I wonder why the speed is wrong. I can't think of a way this
Yeah, could be, but needs more thought. I'd love to push a bit more to On Sun, Aug 21, 2016 at 07:57:09PM -0700, Gergely Imreh wrote:
Juan Cruz Viotti |
As has already been mentioned, if the uncompressed file is over 4GB, gzip returns its size modulo 4GB (because it only has a 32bit size field) - gzip would report a 4.5GB file as 0.5GB, a 9.3GB file as 1.3GB, and a 15.9GB file as 3.9GB. So therefore the only way to correctly get the size is to uncompress the whole file first. However I wonder if you could use some kind of heuristic to 'guess' when gzip has wrapped the size of the compressed file, based on the size of the uncompressed file? (I'm guessing the disk images used with Etcher probably have vaguely similar compression ratios).
Well, I guess if Etcher thinks the uncompressed file is only 0.5GB, but it's actually 4.5GB, perhaps the speed is calculated based on the 0.5GB figure? (4.5GB obviously takes a lot longer to write than 0.5GB would!) P.S. Obviously where I've said '4GB' everywhere above, it's just a shorthand for '2^32 bytes'. |
Sounds good, lets experiment with it. |
I'd suggest that you guys watch your units more closely, especially when building UI or write logic. |
I can never remember which way around they are :-/ And as Etcher is aimed at novices, I expect they might get confused if we started reporting everything in |
The simple way to remember it is to compare it to metric measurements. If it has an SI prefix like metric units do, you know it's a power of ten (as with metric).
As I see it, if a novice doesn't know what gibibytes are, they'll probably just assume that GiB is "Gigabytes". That's a close enough approximation for said novice's uses, I'd imagine. |
@jviotti did anyone end up doing anything with this? It really should be a fairly simple fix for 99% of cases |
(this is our oldest bug still open) |
Is there actually anyone using gzip to compress images over 4GB in size? I.e. does anyone have some example images that can be used for testing, or is this just an edge-case that we'll (probably) never hit in practice? |
I've seen some out there, although its definitely rare. It shouldn't be hard to fix the speed + percentage issues though. I'm happy if we treat gzip sizes with a grain of salt and eventually throw ENOSPC if there is no remaining space in the drive, like we do with bzip2. |
@jhermsmeier Can you help me out with this one? Try flashing an gz image directly with the Etcher CLI. The flash progress quickly reaches 99%, and remains there for quite some time while there's more data coming through the stream chain. Maybe we're not calculating the size correctly somewhere? |
Oh, the sizes seem to be fine. The issue is that for compressed images, the stream chain looks like this:
It looks like the decompression is very fast compared to the drive writing, and therefore the progress reaches 99% much sooner. I'm not sure what would be the solution here. In some cases we only have the uncompressed size, so we should make use of it to show the progress. Maybe there is a way to make the initial readable stream wait for the drive writes before sending more data? |
Looks like pausing/resuming the readable stream should do it. |
I investigated the slow speed issue in more detail, and it looks to be resolved in The slow speed seem to happen on larger compressed images, and also depends on the compression level. I ran various experiments with images of several sizes (from ~1 GB to ~4 GB), using compression levels 1 to 9, inclusive, and the decompression time drastically increases on larger images, which can be reproduced with the In summary:
I think we can close this issue once the last one is fixed. |
@jhermsmeier Check http://linorg.usp.br/OpenELEC/OpenELEC-Generic.x86_64-6.0.3.img.gz for an image that showcases the 99% UX issue. |
Isn't that expected?!? I'd expect a streaming gzip decompressor to have a constant(ish) decompression speed, so of course a larger image will take longer to decompress.
I suggested elsewhere, that since we'll only be supporting streaming images from our online catalog, the online catalog could store the size of the uncompressed image, which will solve this problem (and of course that uncompressed-image-size figure should be automatically updated, so that it never gets out-of-sync with the actual compressed image). |
The current implementation calculates the percentage based on how much was decompressed from the file. For most decompression algorithms we're using, writing to the drive is faster than decompressing, so the progress bar displays fine, however gzip seems to be a special case, because its quite fast. I avoided heuristics and relied on that approach given that for some compression methods (e.g: bzip2), its impossible to get even an estimate, and we'd be really guessing in the dark. I believe that we can pause decompression if we're getting slow on writes, and that should work fine. |
From what I remember reading about Electron when I first started working on Etcher, I thought it was designed around having separate co-operating processes?
I wonder if we can assign threads/processes to have a lower priority than the main thread? Given that we're going to have lots of I/O waits waiting for the SD card to write, there should be plenty of 'spare' CPU time-slots.
Well, I'd still expect even reading from a 'slow' disk to be faster than writing to a SD card?
Yikes! :-( Is that because they're implemented in pure-javascript, and JS isn't designed for manipulating binary data? How much is it slowing things down by - would it be worth trying to find native-binding equivalents? (if I'm using the right terminology)
Maybe that points to gzip being implemented in C rather than JS?
Yeah, the background-decompression-thread approach I suggested above would also work for
Fair enough, looks like I've been out-voted ;-) |
Andrew -- just to clarify, I said that the heuristic approach will be *less
reliable* but much much simpler to write and execute, not the other way
around.
…--
*Alexandros Marinos*
Founder & CEO, Resin.io
+1 206-637-5498
@alexandrosm
On Thu, Mar 9, 2017 at 12:01 PM, Juan Cruz Viotti ***@***.***> wrote:
The current implementation calculates the percentage based on how much was
decompressed from the file. For most decompression algorithms we're using,
writing to the drive is faster than decompressing, so the progress bar
displays fine, however gzip seems to be a special case, because its quite
fast.
I avoided heuristics and relied on that approach given that for some
compression methods (e.g: bzip2), its impossible to get even an estimate,
and we'd be really guessing in the dark.
I believe that we can pause decompression if we're getting slow on writes,
and that should work fine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#638 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLUCFdvsHaHIJu6yYJ4P5BbLNQ3P_iuks5rkFqqgaJpZM4Joo50>
.
|
The threads of electron are exactly two: one for ui/browser stuff, and one
for "server"/node stuff.
…--
*Alexandros Marinos*
Founder & CEO, Resin.io
+1 206-637-5498
@alexandrosm
On Thu, Mar 9, 2017 at 12:33 PM, Andrew Scheller ***@***.***> wrote:
No such thing as threads in Node userland ;)
From what I remember reading about Electron when I first started working
on Etcher, I thought it was designed around having separate co-operating
processes?
if this were to run in parallel, we'd be bottlenecking the CPU
I wonder if we can assign threads/processes to have a lower priority than
the main thread? Given that we're going to have lots of I/O waits waiting
for the SD card to write, there should be plenty of 'spare' CPU time-slots.
or I/O if the source is slow
Well, I'd still expect even reading from a 'slow' disk to be faster than
writing to a SD card?
For most decompression algorithms we're using, writing to the drive is
faster than decompressing
Yikes! :-( Is that because they're implemented in pure-javascript, and JS
isn't designed for manipulating binary data? How much is it slowing things
down by - would it be worth trying to find native-binding equivalents? (if
I'm using the right terminology)
A benchmark <http://tukaani.org/lzma/benchmarks.html> from 12 years ago
(wow!) shows even the most highly-compressed bzip2 file decompressing at
over 5MB/s. (A more recent benchmark
<https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/>
shows bzip2 decompressing at over 20MB/s)
however gzip seems to be a special case, because its quite fast.
Maybe that points to gzip being implemented in C rather than JS?
given that for some compression methods (e.g: bzip2), its impossible to
get even an estimate
Yeah, the background-decompression-thread approach I suggested above
would also work for .bz2 files (although I guess it means the
progress-indicator would have to switch part-way through from being a
"rough estimate, based on how much of the input file has been
decompressed", to "an exact figure, based on how much of the output file
has been written to disk so far").
I think heuristic + counting bytes while writing is probably the less
resource-intensive path.
I believe that we can pause decompression if we're getting slow on writes,
and that should work fine.
Fair enough, looks like I've been out-voted ;-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#638 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLUCDVlo-Ih407SAOM4MQbXReSlT0I4ks5rkGIFgaJpZM4Joo50>
.
|
Ahhhh-hhaaaa!!! I've just realised something significant :-D As seen in #1171 (comment) the 29476.5MiB of the So it's not necessarily just the "gzip being quite fast" that is causing this problem, but the fact that disk images tend to have lots of empty-space at the end, which compresses super-efficiently, and takes up only a very small fraction of the compressed file, but still takes a very long time (much longer than the "real data" in the example in #1171 ) to actually write out to disk. Does that make sense? |
Yeah, we can always create child processes and communicate through IPC, which can provide us with some sort of multi-threading. |
How about displaying something like a wild guess, and saying so, if the compressed size is bigger than gzip claims the output file will be. Although it is possible that a compressed file is truly bigger than the output it would be much better than reporting so really bogus number like with a 5GB input reporting 866MB output! Here's an off the wall idea: Ask the user if they know what the output size is supposed to be! I would expect in most cases they do. Interesting side note: 2 years ago lurch commented "Is there actually anyone using gzip to compress images over 4GB in size". Today why would anyone buy a SD card smaller than 16GB! |
Dear all, As one of many potential examples: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR277/ERR277077/ Thank you for supporting this amazing utility! |
Recently also faced this issue with gzip 1.9 |
@tipuraneo It's a limitation of the |
I know. So you could say gzip is not the best choice for files > 4 GB? I prefer xz over gzip. |
Taking it out from #629, apparently the gzip file format cannot accurately return the size of
files above 4GB (2^32 bytes), but returns the modulo.
Looks like on the command line people recommended something like
zcat file.gz | wc -c
orgzip -dc file | wc -c
which give the correct value - though then decompresses the the file twice. Might have to do that for gzip in the end, though, since likely >4GB files are common for Etcher's use case.This might let images to start to be burned onto cards that are too small (in worst case), or affects the progress bar.
From testing with a 4100MiB > 4096MiB image, indeed
.gz
version lets to select a 512MB SD card, while the same file's.xz
archive does not.For the progress bar, the MB/s reading seems to be affected (shows very low speed, eg. 0.01MB/s) but the progress percentage does not (shows correctly for the burning process), so it's not too bad.
The text was updated successfully, but these errors were encountered: