[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[rdiff-backup-users] Re: more info on 25gig files

From: rsync2eran
Subject: [rdiff-backup-users] Re: more info on 25gig files
Date: Mon, 04 Jul 2005 15:06:07 -0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050513 Thunderbird/1.0.2 Fedora/1.0.2-6 Mnenhy/


Donovan Baarda wrote:

> Ahh, you missed the worst bit... the new file is rolled through checking
> blocks at each byte boundary, so you don't have n blocks for the new
> file, you have z blocks where z is the size of the file in bytes, ie
> 2^32, not 2^22.

Yes, I had this in mind:

>>>In practice it is not quite that bad... as each match effectively skips
>>>over the matching block's bytes, so it is really one block per
>>>non-matching byte...

But this is not consistent with the assumption that the files are not
simialr (e.g., compressed or encrypted). So for those files the
situation is not distrubring, but rather frightening: a
one-in-a-thousand chance of corruption for the 4GB/1K parameters.

>>>ie the critical thing is not the file size, it's
>>>the number of changed bytes. In the case of a large database file with a
>>>small number of changes, you are usually in the clear.

Yes, for "usually" being lower-bounded by the 1/2^21 chance of an
intra-file collision (for the 4GB/1K parameters.

>>>It has got to be worth something... even if it only counts for 4 bits,
>>>it still gives you 16 x the confidence.
> what? you don't think 1 in 16 million is better than 1 in 1 million ?

Oh, sure, I acknowedged its being 16 times better!

>>>You may be able to leave handling of signature collisions up to the
>>>application by simply reporting them and giving the application the
>>>oppertunity to confirm and restart or resume the signature calculation.
>>That's a strange question to ask the application... And often
>>unanswerable (e.g., for unseekable input).
> Yeah, but it's up to the application to decide if it's un-answerable. 

OK, just make sure things keep working reliably if the application can't

>>If you don't check those collision within librsync (as above), maybe
>>it's better to just keep assuming that colliding blocks are identical,
>>go through with the transfer, and then check the whole-file checksum. If
>>it doesn't match, tell the application to redo everything with a new
>>random seed (and perhaps larger hashes, like rsync does automatically).
> The point is, if you get a collision in the signature, you may be able
> to detect it and recover at signature generation time. For certain
> applications this might be much better than finding out months later
> when you try to apply the delta.

Hmm, right. That's an important point. I was thinking mostly of on-line
transfers (e.g., http://tromer.org/misc/rexecsync).

>>Sounds good, especially the variable seed part. :-)

Speaking of variable seeds, the best way to implement it elegantly and
without relying on a patched MD4 (or whatever) implementation is to use
salt, i.e., by prepending the random string to the file data and running
the usual hash.

>>* "rdiff signature" on large cached file:  109MB/sec
>>* OpenSSL MD5, 1KB blocks:                 246MB/sec
>>               8KB blocks:                 326MB/sec
>>* OpenSSL SHA1, 1KB blocks:                148MB/sec
>>                8KB blocks:                174MB/sec
> You are not
> comparing apples with apples here; the rdiff signature includes rollsums
> etc, and probably a crappy md4sum implementation.

My point exactly: it follows that switching to OpenSSL's MD5 won't slow
things down compared to the current code.

> All the arguments for md5 or sha1 vs md4 revolve around extra protection
> against malicious attacks that craft hash collisions. In the case of
> librsync, the biggest threat from malicious attacks does not revolve
> around vulnerabilities in md4sum, but from the fixed seed. Changing to a
> variable seed will solve these, 

I don't think RC4 was ever evaluated in the context of variable seeds.
But it is an extremely weak hash function, so I would not be surprised
if it turns out to be susceptible even to that. If the cost in
performance is negligible and compatibility is broken anyway, why not
use something less broken?

> Though it someone can demonstrate that a different hash has a
> significantly better distribution over random data than md4, then maybe
> it would be worth considering (ie, it would avoid accidental collisions
> better).

I find this unlikely.

> Also, the "metahash" hash of all the blocksums worries me... it might
> not be very reliable.

Merkle hash trees, are quite similar and well-analyzed by the
cryptographic community. In a Merkle hash tree you have a hash function
compressing a 2n-bit string to an n-bit string, and you compute the
n-bit hash of the whole file recursively: first you hash it into half
the size by applying the hash to each aligned 2n-bit block, then to
quarter of the size, etc. It has many wonderful uses. The only essential
difference compared to our meta-hash is that the adversary can affect
the partitioning of the raw data into chuns, but that shouldn't help him
if he can't control the hash of those chunks. But if you want to be
really calm in case of exotic cryptanalytic attacks, when computing the
hash of a chunk represented by a packet, just include the chunk offset
and chunk length in the input to the hash.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]