[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[rdiff-backup-users] Re: more info on 25gig files

From: Donovan Baarda
Subject: [rdiff-backup-users] Re: more info on 25gig files
Date: Thu, 30 Jun 2005 16:31:41 -0700

On Thu, 2005-06-30 at 13:13, address@hidden wrote:
> On 30/06/05 22:46, Donovan Baarda wrote:
> > Actually, librsync uses 8 byte strong checksums, which combined with the
> > 4 byte rollsum, gives you 12 bytes of checksum for each block (but you
> > can only really count on 8 bytes for maliciously crafted data). This
> > makes it comfortingly negligable unless you have huge files and a
> > stupidly small blocksize.
> Sorry, I meant 8. But as for huge and stupid, how about, say, a 4GB file
> and 1K block size? Doesn't seem unreasonable if you're backing up a
> large database with many small local changes. That gives you 2^22
> blocks, hence 2^44 pairs of blocks, hence (for random data) a
> one-in-a-million chance of a corruption.

The number of pairs for n blocks is n*n/2, or 2^43, so that's a one in
2^21 chance.... ie if you have 2 million 4GB files, you are likely to
get a signature md4sum collision in one. That might not be comfortable,
but it is probably not disturbing...

Note that the probability of a corruption is not just the probability of
getting a blocksum collision in the signature. The worst part is the
rollsum walks through the file, effectively trying a block at each
byte-boundary, so on a 4G file, it tries 4G blocks and matches them
against 2^22 1K blocksums, for 2^22 * 2^32 = 2^54 comparisons. This is a
one in a thousand chance of an md4sum collision... much more scary.

In practice it is not quite that bad... as each match effectively skips
over the matching block's bytes, so it is really one block per
non-matching byte... ie the critical thing is not the file size, it's
the number of changed bytes. In the case of a large database file with a
small number of changes, you are usually in the clear.

> As for the rollsum, I don't trust it even for non-malicious data. It's
> too easy to believe it will be fooled by some natural structure in the
> data. And as for malicious settings, well, last time we considered those

It has got to be worth something... even if it only counts for 4 bits,
it still gives you 16 x the confidence.

> we saw corruption can occur with probability 1...

Yeah, it all goes out the window with malicious data... the only fix for
this is a variable seed for the signature... probably random is best.

Note that adding a variable seed allows you to correct a blocksum
collision in the signature... if you detect one, just re-generate the
signature with a different seed.

Note that detecting and correcting this would require seek()
functionality on the input. You can't differentiate a matching blocksum
from a genuine collision or identical block without comparing the actual
block data.

You may be able to leave handling of signature collisions up to the
application by simply reporting them and giving the application the
oppertunity to confirm and restart or resume the signature calculation.

> > The disturbing part is that any corruption that does occur is silently
> > ignored.
> Yes. Do you think the integrity testing should be done by librsync, or
> by every client app (with a prominent notice that they should)?

I'm not sure... I think it makes sense to put it into the signature.
Whether that is done by librsync or rdiff, I don't know. In either case,
it will require a change to the signature file.

If we are going to change the signature file, we may as well do it
right, adding both a whole-file checksum and a variable seed.

I'm now working at google, and am seriously considering making librsync
my 20% project... hopefully this means I will do some serious work on it

Donovan Baarda <address@hidden>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]