monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: Support for binary files, scalability and Windows p


From: graydon hoare
Subject: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Date: Mon, 19 Jan 2004 02:27:28 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4

Asger Kunuk Alstrup wrote:

Anyway, if you are considering to change the fundamental data-structure to a
tree of hashes or something else, please consider to have two different "hash"
functions if that at all makes sense: One that is to be used when you need a
unique identifier to identify a "version" (which I would suggest should just be
a timestamp along with a random string of bits), and another when you need a
short extract that can be used to find data. That would, you could optimise the
data structure for large file support: Only use linear time scans over the data
when you really have to.

no, I cannot do this. it adds too much fragility. monotone is a distributed system with no lock-step synchronization; this means that you and I can perform actions in parallel (adding the same file, merging two trees into a third) and make assertions about those actions which will be meaningful when exchanged with a third party.

what's good about identifying things by content hash is that you and I will always construct the *same* content hash for a given object, even when we are not explicitly communicating. we might be hours, days, years away from reconciling our work, yet we chose the same identifiers. if I make some other UUID which I bind to a content hash by attribution, I need to hold the SHA1<->UUID mapping tables for all the people every involved in my VC system: that set of mappings becomes a critical piece of indirection, and if that indirection breaks or is corrupted the system falls apart.

I know this means that monotone will be limited by the speed of hashing data, and that this will hurt if you're using it for storing video. as I said, you're welcome to work out a different way to identify data based on its content (if you like, put your own favourite UUID in metadata tags inside the file, and extract them on the fly). but I'm not going to add extra code paths to manage another level of indirection for this case.

(or, of course, being free software you are always free to make a derivative work of monotone; I'm just not committing my own time to it)

I still think that the current hash-approach has another downside: In source
control, you often need to revert a file to a previous state. This will result
in the same hash for the file, although it is technically not the same.
Therefore, I would prefer a random string of bits and a time-stamp to identify
versions, in order to avoid these collisions, and in order to avoid linear time
scans of the files.

I understand your concern here, but I cannot currently suppress my desire to keep identifiers in a (probabilistic) bijection with data. my feeling is that reverted files *are* technically the same as their previous state: the historical story about "this file reverted from this other file" is an external, attributed declaration. it is not an intrinsic aspect of the file, nor is the timesamp+random id of the file's creation.

of course, reasonable people may disagree.

-graydon




reply via email to

[Prev in Thread] Current Thread [Next in Thread]