[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[rdiff-backup-users] Re: more info on 25gig files

From: Donovan Baarda
Subject: [rdiff-backup-users] Re: more info on 25gig files
Date: Fri, 01 Jul 2005 19:12:59 -0700

On Fri, 2005-07-01 at 05:44, address@hidden wrote:
> Hi,
> On 01/07/05 02:31, Donovan Baarda wrote:
>  > The number of pairs for n blocks is n*n/2, or 2^43, so that's a one in
> > 2^21 chance.... 
> You're thinking about colliding block signatures in a single file; I was
> thinking about syncing two files that look independently random-looking
> (e.g., compressed and/or encrypted). Then you have n blocks for the new
> file times n blocks for the old file, so no factor of two.

Ahh, you missed the worst bit... the new file is rolled through checking
blocks at each byte boundary, so you don't have n blocks for the new
file, you have z blocks where z is the size of the file in bytes, ie
2^32, not 2^22.

> > ie if you have 2 million 4GB files, you are likely to
> > get a signature md4sum collision in one. That might not be comfortable,
> > but it is probably not disturbing...
> Then it seems our disturbance thresholds are different...
> To put it another way, if you have 1000 users running a daily 4GB sync
> over 3 years, it's likely that at least one of them will see a
> corruption. For increased effect, change 4GB to the subject of this thread.

Yeah... it's still pretty bad...

> > In practice it is not quite that bad... as each match effectively skips
> > over the matching block's bytes, so it is really one block per
> > non-matching byte... ie the critical thing is not the file size, it's
> > the number of changed bytes. In the case of a large database file with a
> > small number of changes, you are usually in the clear.
> That's because of the heuristic that considers first the next block in
> the old file, right? Yes, I guess this does improve things on most
> workloads.

No... not that... the fact that when you do find a match, you don't
attempt to find more matches for blocks at every byte offset in that

The "favor next block" behaviour is something that would help avoid
problems when you have blocksum collisions in a signature, but it's not
worth relying on. I'm also not sure that it's actually implemented....

> >>As for the rollsum, I don't trust it even for non-malicious data. It's
> >>too easy to believe it will be fooled by some natural structure in the
> >>data. And as for malicious settings, well, last time we considered those
> > 
> > It has got to be worth something... even if it only counts for 4 bits,
> > it still gives you 16 x the confidence.
> ACK.

what? you don't think 1 in 16 million is better than 1 in 1 million ?

> > You may be able to leave handling of signature collisions up to the
> > application by simply reporting them and giving the application the
> > oppertunity to confirm and restart or resume the signature calculation.
> That's a strange question to ask the application... And often
> unanswerable (e.g., for unseekable input).

Yeah, but it's up to the application to decide if it's un-answerable. 

> If you don't check those collision within librsync (as above), maybe
> it's better to just keep assuming that colliding blocks are identical,
> go through with the transfer, and then check the whole-file checksum. If
> it doesn't match, tell the application to redo everything with a new
> random seed (and perhaps larger hashes, like rsync does automatically).

The point is, if you get a collision in the signature, you may be able
to detect it and recover at signature generation time. For certain
applications this might be much better than finding out months later
when you try to apply the delta.

> >>>The disturbing part is that any corruption that does occur is silently
> >>>ignored.
> >>
> >>Yes. Do you think the integrity testing should be done by librsync, or
> >>by every client app (with a prominent notice that they should)?
> > 
> > I'm not sure... I think it makes sense to put it into the signature.
> > Whether that is done by librsync or rdiff, I don't know. In either case,
> > it will require a change to the signature file.
> > 
> > If we are going to change the signature file, we may as well do it
> > right, adding both a whole-file checksum and a variable seed.
> Sounds good, especially the variable seed part. :-)
> Since that forces a change in librsync, maybe use the opportunity to
> change from MD4 to a less broken hash? Just a quick check on my 2GHz
> Athlon XP:
> * "rdiff signature" on large cached file:  109MB/sec
> * OpenSSL MD5, 1KB blocks:                 246MB/sec
>                8KB blocks:                 326MB/sec
> * OpenSSL SHA1, 1KB blocks:                148MB/sec
>                 8KB blocks:                174MB/sec

I'm unconvinced that a different signature is worth it. You are not
comparing apples with apples here; the rdiff signature includes rollsums
etc, and probably a crappy md4sum implementation.

All the arguments for md5 or sha1 vs md4 revolve around extra protection
against malicious attacks that craft hash collisions. In the case of
librsync, the biggest threat from malicious attacks does not revolve
around vulnerabilities in md4sum, but from the fixed seed. Changing to a
variable seed will solve these, and I don't think another hash would do
anything other than chew more CPU.

Though it someone can demonstrate that a different hash has a
significantly better distribution over random data than md4, then maybe
it would be worth considering (ie, it would avoid accidental collisions

> So upgrading to MD5 is almost for free, and SHA1 won't slow things down
> noticeably either. With randomized IV, both (but especially SHA1) should
> be OK in this context.
> BTW, a strong full-file hash would have its own computational cost: both
>  the sender and the receiver would need to compute it in addition to the
> per-block hashes. In principle, for mostly-unchanged files this cost can
> be made negligible by replacing the full-file hash by a "meta-hash": as
> you transmit or receive, note the hash of the data represented by each
> packet, and compute the meta-hash as the hash of all these hashes. For a
> literal data packets, the packet's hash is just the hash of its literal
> content. But for a copy-block-from-old-file packet, the sender takes the
> packet hash to be the (untruncated) hash of the *new* block, while the
> receiver takes the packet hash to be the (untruncated) hash of the
> copied *old* block -- both of these were computed anyway by the
> respective parties! One catch is that, in the librsync framework, during
> the delta stage the receiver no longer remembers the untruncated block
> hashes computed at the delta stage, and if he needs to recompute them
> then he doesn't get any benefit compared to plain full-file hashes. But
> the sender (which is already doing the heavy lifting) does get the
> meta-hash essentially for free.

Hmm. I'll have to think about this. My gut feeling is an additional
whole file hash calculated alongside the blocksum hashes would not be a
significant cost... particularly compared to the rollsum+hashtable
lookups at every byte.

Also, the "metahash" hash of all the blocksums worries me... it might
not be very reliable.

Donovan Baarda <address@hidden>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]