Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle & track block write / verification failures #1074

Open
jhermsmeier opened this issue Feb 1, 2017 · 10 comments
Open

Handle & track block write / verification failures #1074

jhermsmeier opened this issue Feb 1, 2017 · 10 comments

Comments

@jhermsmeier
Copy link
Contributor

  • Etcher version: any
  • Operating system and architecture: any

New issue to track suggestions made in #735 (comment) to keep & expose data on blocks which failed to be written during the flashing of an image.

@jhermsmeier
Copy link
Contributor Author

and see if its feasible to get a percentage of failed blocks somehow (then we can figure out what to do with the data).

Keeping track of a percentage should be pretty straight forward, as we only need to count blocks written over blocks failed (basically adding to counters, which should be trivial with the "new" block-write-streams)

It shouldn't even be too hard to keep the blocks that failed to be written, as long as it's not a high percentage failing (or we just keep the block addresses, then even that wouldn't be much of a problem until we hit really high percentages on large drives).

@lurch
Copy link
Contributor

lurch commented Feb 1, 2017

Just to clarify - #735 wasn't talking about block-write failures (I believe @jviotti made a change so that write-failures get tried something like 10 times before giving up), but about block-verify failures (which implies that we'd need to change the checksumming process).

@jviotti
Copy link
Contributor

jviotti commented Feb 1, 2017

It shouldn't even be too hard to keep the blocks that failed to be written, as long as it's not a high percentage failing (or we just keep the block addresses, then even that wouldn't be much of a problem until we hit really high percentages on large drives).

Maybe keep a precise count up to a certain limit, and if that limit is surpassed, just keep and report a percentage? (e.g: 25% failed). If a lot of blocks failed, detailed information is not very useful anymore.

@jviotti
Copy link
Contributor

jviotti commented Feb 1, 2017

but about block-verify failures (which implies that we'd need to change the checksumming process).

Yeah, exactly, that is the challenge of this feature.

@jhermsmeier
Copy link
Contributor Author

jhermsmeier commented Feb 1, 2017

but about block-verify failures (which implies that we'd need to change the checksumming process).

Oh, sorry – I somehow missed that. Well, if we want to track which blocks didn't check out while verifying, we will need to use entirely different hashing mechanisms like Merkle trees (used in Bittorrent, IPFS, and other filesystems) or rolling hashes, such as rabin fingerprinting (used by LBFS, the dat project, and probably a better choice for what we're thinking about here). CRC32 (which is prone to collisions), MD5, and the SHA-family (or similar) won't do us much good in this case.

I'm starting to think it could actually make sense to drop the full disk CRC / MD5 / SHA / etc checksumming entirely, and only verify the source image with those (i.e. if a file of the same basename, but a .md5 extension is present) – possibly, but not necessarily before even starting to flash the image, so we know that the source is OK.

Then we could calculate rabin fingerprints of the block-stream while writing, and verify the flashed device with those afterwards – that would give us the ability to determine which blocks exactly were corrupted, compute a percentage, etc, etc.

Following that, we'd basically have some more options:

  1. We could attempt to write only the failed blocks to the target device again, followed by a re-check of those – and if it succeeds, everything should be fine
  2. We give up and give the user information about how what failed
  3. A combination of 1. and 2.: We give the user info about the failures, and ask if the user wants to retry writing those blocks

@lurch
Copy link
Contributor

lurch commented Feb 1, 2017

But presumably any rolling-hash or fingerprinting scheme would have to be a tradeoff between the blocksize used and the memory used to store the whole result for a potentially multi-gigabyte disk image, which might have been streamed from the internet?

Pinging @petrosagg as he might want to join in the continuation of the conversation from #735

@jhermsmeier
Copy link
Contributor Author

jhermsmeier commented Feb 1, 2017

I think the memory requirements should be low enough to just keep the hashes in memory – except for the smallest block sizes (which would be terribly inefficient to write anyways):

image image size block size block count hash length memory required
Raspbian Jesse 4371513344 B (4.07 GB) 262144 B (256 KB) 16676 32 B 533632 B (~521 KB)
Raspbian Jesse 4371513344 B (4.07 GB) 512 B (0.5 KB) 8538112 32 B 273219584 B (~260.6 MB)
Random 10 GB 10737418240 B (10 GB) 262144 B (256 KB) 40960 32 B 1310720 B (1.25 MB)

@jhermsmeier jhermsmeier changed the title Handle & track block write failures Handle & track block write / verification failures Feb 1, 2017
@jviotti
Copy link
Contributor

jviotti commented Feb 1, 2017

So we need to support MD5 and other common checksum algorithms for the downloading phase in case of images we know about that are hosted in the cloud.

When the user attempts to stream an image with extended information, we calculate the checksum they tell us as part of the downloading phase and compare it with what they've told us once the download completes.

In the mid-time, etcher-image-write can calculate another type of checksum that fits this purpose (e.g: a rolling hash) and recalculate that same checksum type from the drive itself.

Another way to go would be what Tizen already provides. Their XML file contains checksums (sha1 or sha256 usually) for every block range. So keeping that in mind we can calculate the checksum of X amount of blocks and store it as we go, and calculate back and compare (thus no need of rolling hashes).

Of course this means that we're not doing per-block checksums (otherwise I guess it'd be wasteful, although I'd like to see some numbers), so we can't be that precise in our results. As far as I remember the block size we use in etcher-image-write is 1 MB, so maybe hashing every MB is not that bad (specially if we choose a fast algorithm).

@jhermsmeier
Copy link
Contributor Author

Another way to go would be what Tizen already provides. Their XML file contains checksums (sha1 or sha256 usually) for every block range. So keeping that in mind we can calculate the checksum of X amount of blocks and store it as we go, and calculate back and compare (thus no need of rolling hashes).

Indeed, I just added an issue regarding that to balena-io-modules/blockmap#6

So we need to support MD5 and other common checksum algorithms for the downloading phase in case of images we know about that are hosted in the cloud.

Yup, I was only suggesting dropping them for the verification step, when reading back from the flashed device, but still verifying the source with them, if that makes sense?

Of course this means that we're not doing per-block checksums (otherwise I guess it'd be wasteful, although I'd like to see some numbers), so we can't be that precise in our results. As far as I remember the block size we use in etcher-image-write is 1 MB, so maybe hashing every MB is not that bad (specially if we choose a fast algorithm).

I think we could still do per block rolling hashes when using bmaps (additional to checking the bmap region checksums), as some mapped regions might be quite large, and only having to rewrite a few blocks is probably plenty faster than having to rewrite an entire mapped region.

@jhermsmeier
Copy link
Contributor Author

I've been talking nonsense here, I realised;

CRC32 (which is prone to collisions), MD5, and the SHA-family (or similar) won't do us much good in this case.

Looking at the above table, we can just as well hash every block with an MD5 or SHA or whatever floats our boat and keep it in memory for the block sizes we use. Don't know why my mind was going to those complicated places with this before.

@jviotti jviotti modified the milestone: v2.0 Mar 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants