[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] RE: Support for binary files, scalability and Windows p

From: Asger Kunuk Alstrup
Subject: [Monotone-devel] RE: Support for binary files, scalability and Windows port
Date: Sun, 18 Jan 2004 14:39:43 +0100

graydon hoare wrote:
> eh, I doubt anything intrinsically *relies* on it; the hash is pretty
> much treated as an opaque function from data->identifier. you could in
> theory change it to use any sort of hashing scheme: partial, smaller,
> faster, bigger, stronger. there are many to choose from.

My question was more what requirements you have for this function from data to
id in the most general sense.

Does it need to be injective? I.e. should different data with reasonable
probability result in different ids?

Does it need to be surjective? I.e. based on an id, you should with reasonable
probability be able to find the data it corresponds to?

It seems from your comments that you use it to identify data based on the id, so
it has to be both, i.e. bijective. In other words, you can not get away with the
function "hash(data) = random string of bits", because then you can not identify
the data from the id. So, the answer is: It has to be a hash.

I had hoped that the function did not have to be surjective, so that it could be
constant time, rather than linear in the amount of data.

Looking, at the source, there are 31 calculate_ident calls outside transforms.*.
So reviewing those, and determining exactly what they require could help. In
some cases, it might be possible to only calculate the hash on some parts of the
files, but other than that, I guess that the best approach is to avoid and
optimise the hash-calculation as much as possible.

Anyway, if you are considering to change the fundamental data-structure to a
tree of hashes or something else, please consider to have two different "hash"
functions if that at all makes sense: One that is to be used when you need a
unique identifier to identify a "version" (which I would suggest should just be
a timestamp along with a random string of bits), and another when you need a
short extract that can be used to find data. That would, you could optimise the
data structure for large file support: Only use linear time scans over the data
when you really have to.

I still think that the current hash-approach has another downside: In source
control, you often need to revert a file to a previous state. This will result
in the same hash for the file, although it is technically not the same.
Therefore, I would prefer a random string of bits and a time-stamp to identify
versions, in order to avoid these collisions, and in order to avoid linear time
scans of the files.

Best regards,
Asger Ottar Alstrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]