monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: Support for binary files, scalability and Windows p


From: graydon hoare
Subject: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Date: Tue, 20 Jan 2004 13:25:18 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4

Asger Kunuk Ottar Alstrup wrote:

In order to represent this accurately, in the face of distributed use, I
think you need to represent every single change as an edge in the graph
somehow, and the order in which they happened. In other words, you
effectively have to record an ordering of your back-edges or
cancellation edges.

well, in a sense you're right. I don't want to beleaguer the point too much, except to point out again that the DAG is kept over *manifest* versions, not file versions. so for example if you add even 1 bit to a ChangeLog file on each revision, the ChangeLog SHA1 changes, and the manifest ID changes, and I have distinct nodes in my graph again.

granted, this is a bit of a cheap hack; it's just (a) simple and (b) in the hands of the user. if they want to incorporate the date and time of the last revision -- or a UUID for that matter -- into the notion of a "version", it's as simple as making sure it shows up in an easily-merged file somewhere in the manifest.

it's a simple model, and simplicity is important to me: if monotone's model of something grows too complex, my reasoning about the model gets weak and error-prone, not to mention it becomes harder to explain the model to users. since users like to consider version control "very permanent and safe", it's important for them to understand what it's doing beneath the covers, at least in general.

That is a good proposal, and that might work for video files. I think I
need to give you a little background of where I am coming from.
> ...

ahh, here is the juicy part. your needs are clearly formidable, and you are willing to dedicate some effort to solving them. fair enough. let me split what I see as your requirements into 3 sections:

 - the need to mark some files as "opaque", in the sense that they are
   not necessarily scanned for common substructure with their own past
   versions or neighbours, not gzipped, not merged.

 - the need to support very large files: overcoming the 16mb limit in
   the database, and removing any cases in which files are loaded into
   memory in their entirety.

 - the need, possibly, to change the way files are identified for one
   of two reasons: hashing takes too long, and (possibly) there are
   unacceptable failure cases in history graphs built from hashes.

I can imagine handling "opaqueness" with a hook: call the hook with a pathname (or other identifier), and if it returns true, monotone always stores and sends complete versions (no xdelta or similar-block scanning) and doesn't bother gzipping. not too much effort to implement. we'd need to locate all the places we make assumptions about gzip and xdelta, and predicate them on the hook.

I think we're on our way to supporting large files. breaking the 16mb barrier is probably the easy part since it can be confined to the storage system. if we're going to a block-collection model for storage anyways, that would buy you 16mb of block commands. say each block command is 128 bits, then you can fit a million of those in an existing 16mb fragment, so you might be able to store say files of 16tb in size.

removing all the places where we assume we can load a file into memory might be hard, might not be. if you can live with loading the "top" item in a file -- the up-to-16mb block-command list -- into memory all at once, we only need to change places where we "reach inside" that data, rather than all possible references. or, if even that is too expensive, we could possibly make the data object lazy, so that it keeps a small memory cache of its own sections, and loads/flushes them on demand. complex, but doable.

finally, the change of identifier type: again, I am wary of the indirection-table approach, so I am trying to consider alternatives. I think this could be done with a hook. the calculate_identifier() calls could be changed to depend on a hook which optionally picks some non-SHA1 way of calculating identifiers. then if you have something else in mind you can use it. it would require all the users of a given project to have that hook installed, but otherwise monotone would be completely ignorant of your chosen strategy.

would this set of changes satisfy your needs? I would be happy to accomodate these as they are mostly hidden from smaller-scale users, and can be described in "advanced use" sections of the manual. they are the sort of compromise unlikely to cause mainstream breakage or unnecessary multiplication of ideas in the easy cases.

-graydon




reply via email to

[Prev in Thread] Current Thread [Next in Thread]