[Monotone-devel] Re: Support for binary files, scalability and Windows p

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: Support for binary files, scalability and Windows p

From:	graydon hoare
Subject:	[Monotone-devel] Re: Support for binary files, scalability and Windows port
Date:	Mon, 19 Jan 2004 03:02:36 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4

Ori Berger wrote:

If you have a persistent suffix tree (one tree for the _entire_
repository), when you want to put a new version in the database, you
just start walking the tree, and note what part of it you take from

what repository file.

yes, I somewhat got the impression this is what you were saying; itsounds very exciting, but I'm just not aware of enough of the details ofhow it works to, uh.. make it work. if you'd be so kind as to take methrough it step by step, inserting say 3 or 4 different strings withshared substrings into a persistent database, or point to a paper on useof persistent suffix trees, I'd appreciate it. I'm a bit confused abouthow much the strings need to be torn up into fragments (eg. how the treestores S1 and S2 where several sub-ranges of S1 and S2 are the same, doI need to split each into fragments and store them individually?)

What you're describing is, as you said, extended xdelta. What I wasdescribing is doing away with ancestry as a factor in storage andidentification. If breaking to blocks depends on context (e.g.,ancestry), then, patches that are applied in a different order (butresult in the same outcome) are likely to generate different blockstructures. Personally, I much prefer stateless representations thatdepend only on the data and not its history - YMMV.

oh, absolutely, I prefer that too. I wasn't suggesting breaking datainto blocks based on ancestry; I was suggesting that the fragment listthat represents a file might be calculated by searching the adler32signatures found in blocks used by recent ancestors. that's just aheuristic for finding likely overlapping blocks: ancestors are probably"nearby" in any similarity metric you have in mind.

if you can work out how suffix trees let me seek directly tosub-structure similarities, all the better.

I didn't understand that one. I was suggesting (or at least, thought Iwas suggesting) either (a) to include SHA1 for blocks inside the samemanifest, or alternatively, (b) keep another per-file manifest that sayshow to construct it from its SHA1 blocks, in the same way that amanifest can be used to construct an atomic revision from files.

I'm not changing the main manifest format, so I'm taking door (b) here.that's fine: in a sense files are already "manifests" of "how toconstruct" them: they're xdelta edit scripts which point to extents inmore recent versions of the file. we'd just be flattening the historicalstructure of data referenced by such scripts into a uniform set of blocks.

Yep, it's not too bright; But I wasn't suggesting dropping SHA1 in anyway, and I couldn't understand the relation to inodes. Could you pleaseelaborate?

oh, I just meant that the data structure of a list of pointers to blocks(and possibly sub-lists of blocks) is the same as an inode, and it's"machinery" hidden from both a directory listing and a linearbyte-stream view of a file. the filesystem hides it. I intend to hidethis too. it's an implementation issue to handle large files andpossibly help construct compact storage & synchronization structures.

the user's view would be unaffected: manifests map filenames to SHA1codes, which identify linear streams of bytes.

Sounds good. As Nathaniel hinted in a recent post, hash treesynchronization might not play well with your block-extended xdelta; Itneeds to be stateless to play well, I think.

no, synchronization would be over the block set or the file set. therepresentation of files would be left to the individual storage manager(as xdelta is now -- if I receive a file it is completely up to me how Ichoose to store it).

Ouch. One of the things I like so much about monotone is that it istransport agnostic. Requiring a smart online server would kill that. Ialso think it's an increased barrier for many people.

maybe. as I said in other mail, I'm not about to do anything which Iwould perceive as hobbling monotone. on the other hand, I see a lot ofcomplaint about the awkwardness of the CGI system, and I've so far heardof nobody using the NNTP system.

the "fallback" form of synchronization is to get a list of *everything*on your synchronization host, and transfer all the missing bits. thatcan be done over (S)FTP; it just has really unfortunate scalability.

It should still be possible to build an NNTP/email/dumb-web gatewaythough, even if you do move to an interactive hashtree-sync.

or that, yes. but do I want to keep that code alive and debugged ifnobody uses it? not especially. we'll see how it plays out. let me getworking code first so we're not just waving our hands at each other :)


-graydon

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] Re: Support for binary files, scalability and Windows port, (continued)

Prev by Date: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Next by Date: [Monotone-devel] RE: Support for binary files, scalability and Windows port
Previous by thread: Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Next by thread: [Monotone-devel] RE: Support for binary files, scalability and Windows port
Index(es):
- Date
- Thread