[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: Support for binary files, scalability and Windows p

From: graydon hoare
Subject: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Date: Mon, 19 Jan 2004 03:02:36 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4

Ori Berger wrote:

If you have a persistent suffix tree (one tree for the _entire_
repository), when you want to put a new version in the database, you
just start walking the tree, and note what part of it you take from
what repository file.

yes, I somewhat got the impression this is what you were saying; it sounds very exciting, but I'm just not aware of enough of the details of how it works to, uh.. make it work. if you'd be so kind as to take me through it step by step, inserting say 3 or 4 different strings with shared substrings into a persistent database, or point to a paper on use of persistent suffix trees, I'd appreciate it. I'm a bit confused about how much the strings need to be torn up into fragments (eg. how the tree stores S1 and S2 where several sub-ranges of S1 and S2 are the same, do I need to split each into fragments and store them individually?)

What you're describing is, as you said, extended xdelta. What I was describing is doing away with ancestry as a factor in storage and identification. If breaking to blocks depends on context (e.g., ancestry), then, patches that are applied in a different order (but result in the same outcome) are likely to generate different block structures. Personally, I much prefer stateless representations that depend only on the data and not its history - YMMV.

oh, absolutely, I prefer that too. I wasn't suggesting breaking data into blocks based on ancestry; I was suggesting that the fragment list that represents a file might be calculated by searching the adler32 signatures found in blocks used by recent ancestors. that's just a heuristic for finding likely overlapping blocks: ancestors are probably "nearby" in any similarity metric you have in mind.

if you can work out how suffix trees let me seek directly to sub-structure similarities, all the better.

I didn't understand that one. I was suggesting (or at least, thought I was suggesting) either (a) to include SHA1 for blocks inside the same manifest, or alternatively, (b) keep another per-file manifest that says how to construct it from its SHA1 blocks, in the same way that a manifest can be used to construct an atomic revision from files.

I'm not changing the main manifest format, so I'm taking door (b) here. that's fine: in a sense files are already "manifests" of "how to construct" them: they're xdelta edit scripts which point to extents in more recent versions of the file. we'd just be flattening the historical structure of data referenced by such scripts into a uniform set of blocks.

Yep, it's not too bright; But I wasn't suggesting dropping SHA1 in any way, and I couldn't understand the relation to inodes. Could you please elaborate?

oh, I just meant that the data structure of a list of pointers to blocks (and possibly sub-lists of blocks) is the same as an inode, and it's "machinery" hidden from both a directory listing and a linear byte-stream view of a file. the filesystem hides it. I intend to hide this too. it's an implementation issue to handle large files and possibly help construct compact storage & synchronization structures.

the user's view would be unaffected: manifests map filenames to SHA1 codes, which identify linear streams of bytes.

Sounds good. As Nathaniel hinted in a recent post, hash tree synchronization might not play well with your block-extended xdelta; It needs to be stateless to play well, I think.

no, synchronization would be over the block set or the file set. the representation of files would be left to the individual storage manager (as xdelta is now -- if I receive a file it is completely up to me how I choose to store it).

Ouch. One of the things I like so much about monotone is that it is transport agnostic. Requiring a smart online server would kill that. I also think it's an increased barrier for many people.

maybe. as I said in other mail, I'm not about to do anything which I would perceive as hobbling monotone. on the other hand, I see a lot of complaint about the awkwardness of the CGI system, and I've so far heard of nobody using the NNTP system.

the "fallback" form of synchronization is to get a list of *everything* on your synchronization host, and transfer all the missing bits. that can be done over (S)FTP; it just has really unfortunate scalability.

It should still be possible to build an NNTP/email/dumb-web gateway though, even if you do move to an interactive hashtree-sync.

or that, yes. but do I want to keep that code alive and debugged if nobody uses it? not especially. we'll see how it plays out. let me get working code first so we're not just waving our hands at each other :)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]