Handle & track block write / verification failures #1074

jhermsmeier · 2017-02-01T14:46:42Z

Etcher version: any
Operating system and architecture: any

New issue to track suggestions made in #735 (comment) to keep & expose data on blocks which failed to be written during the flashing of an image.

jhermsmeier · 2017-02-01T14:52:54Z

and see if its feasible to get a percentage of failed blocks somehow (then we can figure out what to do with the data).

Keeping track of a percentage should be pretty straight forward, as we only need to count blocks written over blocks failed (basically adding to counters, which should be trivial with the "new" block-write-streams)

It shouldn't even be too hard to keep the blocks that failed to be written, as long as it's not a high percentage failing (or we just keep the block addresses, then even that wouldn't be much of a problem until we hit really high percentages on large drives).

lurch · 2017-02-01T16:01:16Z

Just to clarify - #735 wasn't talking about block-write failures (I believe @jviotti made a change so that write-failures get tried something like 10 times before giving up), but about block-verify failures (which implies that we'd need to change the checksumming process).

jviotti · 2017-02-01T17:08:05Z

It shouldn't even be too hard to keep the blocks that failed to be written, as long as it's not a high percentage failing (or we just keep the block addresses, then even that wouldn't be much of a problem until we hit really high percentages on large drives).

Maybe keep a precise count up to a certain limit, and if that limit is surpassed, just keep and report a percentage? (e.g: 25% failed). If a lot of blocks failed, detailed information is not very useful anymore.

jviotti · 2017-02-01T17:08:34Z

but about block-verify failures (which implies that we'd need to change the checksumming process).

Yeah, exactly, that is the challenge of this feature.

jhermsmeier · 2017-02-01T17:59:36Z

but about block-verify failures (which implies that we'd need to change the checksumming process).

Oh, sorry – I somehow missed that. Well, if we want to track which blocks didn't check out while verifying, we will need to use entirely different hashing mechanisms like Merkle trees (used in Bittorrent, IPFS, and other filesystems) or rolling hashes, such as rabin fingerprinting (used by LBFS, the dat project, and probably a better choice for what we're thinking about here). CRC32 (which is prone to collisions), MD5, and the SHA-family (or similar) won't do us much good in this case.

I'm starting to think it could actually make sense to drop the full disk CRC / MD5 / SHA / etc checksumming entirely, and only verify the source image with those (i.e. if a file of the same basename, but a .md5 extension is present) – possibly, but not necessarily before even starting to flash the image, so we know that the source is OK.

Then we could calculate rabin fingerprints of the block-stream while writing, and verify the flashed device with those afterwards – that would give us the ability to determine which blocks exactly were corrupted, compute a percentage, etc, etc.

Following that, we'd basically have some more options:

We could attempt to write only the failed blocks to the target device again, followed by a re-check of those – and if it succeeds, everything should be fine
We give up and give the user information about how what failed
A combination of 1. and 2.: We give the user info about the failures, and ask if the user wants to retry writing those blocks

lurch · 2017-02-01T18:23:56Z

But presumably any rolling-hash or fingerprinting scheme would have to be a tradeoff between the blocksize used and the memory used to store the whole result for a potentially multi-gigabyte disk image, which might have been streamed from the internet?

Pinging @petrosagg as he might want to join in the continuation of the conversation from #735

jhermsmeier · 2017-02-01T18:54:15Z

I think the memory requirements should be low enough to just keep the hashes in memory – except for the smallest block sizes (which would be terribly inefficient to write anyways):

image	image size	block size	block count	hash length	memory required
Raspbian Jesse	4371513344 B (4.07 GB)	262144 B (256 KB)	16676	32 B	533632 B (~521 KB)
Raspbian Jesse	4371513344 B (4.07 GB)	512 B (0.5 KB)	8538112	32 B	273219584 B (~260.6 MB)
Random 10 GB	10737418240 B (10 GB)	262144 B (256 KB)	40960	32 B	1310720 B (1.25 MB)

jviotti · 2017-02-01T21:42:46Z

So we need to support MD5 and other common checksum algorithms for the downloading phase in case of images we know about that are hosted in the cloud.

When the user attempts to stream an image with extended information, we calculate the checksum they tell us as part of the downloading phase and compare it with what they've told us once the download completes.

In the mid-time, etcher-image-write can calculate another type of checksum that fits this purpose (e.g: a rolling hash) and recalculate that same checksum type from the drive itself.

Another way to go would be what Tizen already provides. Their XML file contains checksums (sha1 or sha256 usually) for every block range. So keeping that in mind we can calculate the checksum of X amount of blocks and store it as we go, and calculate back and compare (thus no need of rolling hashes).

Of course this means that we're not doing per-block checksums (otherwise I guess it'd be wasteful, although I'd like to see some numbers), so we can't be that precise in our results. As far as I remember the block size we use in etcher-image-write is 1 MB, so maybe hashing every MB is not that bad (specially if we choose a fast algorithm).

jhermsmeier · 2017-02-01T21:55:01Z

Another way to go would be what Tizen already provides. Their XML file contains checksums (sha1 or sha256 usually) for every block range. So keeping that in mind we can calculate the checksum of X amount of blocks and store it as we go, and calculate back and compare (thus no need of rolling hashes).

Indeed, I just added an issue regarding that to balena-io-modules/blockmap#6

So we need to support MD5 and other common checksum algorithms for the downloading phase in case of images we know about that are hosted in the cloud.

Yup, I was only suggesting dropping them for the verification step, when reading back from the flashed device, but still verifying the source with them, if that makes sense?

Of course this means that we're not doing per-block checksums (otherwise I guess it'd be wasteful, although I'd like to see some numbers), so we can't be that precise in our results. As far as I remember the block size we use in etcher-image-write is 1 MB, so maybe hashing every MB is not that bad (specially if we choose a fast algorithm).

I think we could still do per block rolling hashes when using bmaps (additional to checking the bmap region checksums), as some mapped regions might be quite large, and only having to rewrite a few blocks is probably plenty faster than having to rewrite an entire mapped region.

jhermsmeier · 2017-02-14T17:04:54Z

I've been talking nonsense here, I realised;

CRC32 (which is prone to collisions), MD5, and the SHA-family (or similar) won't do us much good in this case.

Looking at the above table, we can just as well hash every block with an MD5 or SHA or whatever floats our boat and keep it in memory for the block sizes we use. Don't know why my mind was going to those complicated places with this before.

jhermsmeier added component:sdk platform:all status:pending design type:feature labels Feb 1, 2017

jhermsmeier mentioned this issue Feb 1, 2017

Improve the "validation failed" error message wording #735

Closed

jhermsmeier changed the title ~~Handle & track block write failures~~ Handle & track block write / verification failures Feb 1, 2017

jhermsmeier mentioned this issue Feb 6, 2017

Add block range checksum validation to streams balena-io-modules/blockmap#6

Closed

jviotti modified the milestone: v2.0 Mar 17, 2017

jviotti removed the status:pending design label Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle & track block write / verification failures #1074

Handle & track block write / verification failures #1074

jhermsmeier commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017

lurch commented Feb 1, 2017

jviotti commented Feb 1, 2017

jviotti commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017 •

edited

lurch commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017 •

edited

jviotti commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017

jhermsmeier commented Feb 14, 2017

Handle & track block write / verification failures #1074

Handle & track block write / verification failures #1074

Comments

jhermsmeier commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017

lurch commented Feb 1, 2017

jviotti commented Feb 1, 2017

jviotti commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017 • edited

lurch commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017 • edited

jviotti commented Feb 1, 2017

jhermsmeier commented Feb 1, 2017

jhermsmeier commented Feb 14, 2017

jhermsmeier commented Feb 1, 2017 •

edited

jhermsmeier commented Feb 1, 2017 •

edited