[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: Support for binary files, scalability and Windo

From: graydon hoare
Subject: Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Date: Sat, 17 Jan 2004 01:11:14 -0500
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4

Ori Berger wrote:

I had been thinking about this as well

there are some things in here I'm curious about, some I disagree with..

First, it's possible to use a suffix tree to locate repeating parts. It's possible to implement much, much faster than rsync's algorithm

rsync's algorithm is a 2-level thing which works when only one source is available and you only have an [adler32,md4] tuple list from the other guy; xdelta's "modification" is to drop that assumption and extend adler32-matched windows forwards and backwards by directly comparing source and destination regions. cost per byte in that phase is 1 compare. the cost per byte when scanning for matching windows is 1 add, 1 subtract, a mask, and a hashtable lookup (the expensive part). I don't really see how or why a suffix array would find anchor regions any faster. wouldn't the cost per byte be a partial (failing) suffix array lookup? can you elaborate?

<>. This implementation doesn't use persistent storage, but if it did, then it would take more or less zero time to locate _all_ sources for a file almost immediately.

I don't know what you mean by "all sources for a file". could you be more specific about where you mean to use a suffix array, and how it helps with a particular bit of retrieval or storage? the way I see it, a persistent suffix array makes the on-disk representation of a string considerably larger, without improving the xdelta problem (which is: find all common regions between these two strings)

Find a way to partition files into blocks, in such a way that:
1. Block lengths are reasonable (definitions later)
2. The block boundaries are likely to be unchanged by small modifications - i.e. an insertion of 1 byte somewhere should have small probability of changing the boundary area, smaller probability of changing two boundaries, .... vanishingly small probability of changing all boundaries.

Now, given a file, break it into blocks; Store each block independently, with its own hash; Store a manifest that says how to rebuild a file from its blocks. And that's it.

yes, I've been thinking something somewhat along these lines (for dealing with the file storage issue, anyways). I would make a few modifications to this scheme, though.

  - since this scheme doesn't involve coalescing or extending matching
    blocks (like xdelta does) the blocks need to be larger than you're
    thinking. a block identified by a SHA1 needs to be a few hundred
    bytes at minimum. the SHA1 itself is 20 bytes (binary) and as I
    will note momentarily I think adding extents will help, so a
    reference to a block extent will likely be 20 + 8 + 8 = 36 bytes

  - for inserts smaller than minimum block size, it is better to just
    put an "insert len [literal data]" marker in the script, rather
    than a reference to a tiny block fragment. if the sum of all
    the literals in a script is larger than block size, you can
    concatenate them and push that one block down to the block layer,
    replacing the literals with references.

  - for deletes less than the size of the block containing the delete,
    or any sort of insert in the middle of a block, your script will
    benefit from being able to denote extents within blocks, rather than
    all-or-nothing references to blocks.

  - I'm still not precisely sure how to accomodate searching for
    matching blocks or sub-block extents between a new file and "the
    entire database". it's possible your thoughts on suffix arrays
    will come in handy here; otherwise I am thinking a simpler

           - when the file is "new" to the database (has no previous
             matching manifest entry) split it up into new blocks.

           - when a file is being "patched", run fine-grained xdelta
             to locate matching extents with a table containing the
             adler32s of all small windows within all blocks of the
             previous version of the file

           - after each xdelta-like operation, check to see if there are
             enough inserts to coalesce into a new block

    that technique is really nothing more than extending xdelta to work
    over block-structured files, which is (as I'll get to) advantageous
    since it permits Very Large Files (~2 ** 40 or so) and also makes
    network sync work.

    hashing Very Large Files would still be a bit of a bitch, though.

The blocks could be identified by their SA1, and the manifest could either be file-specific, or the standard manifest extended to include a part specification, e.g.

hmm, no, not changing the manifest format. the file remains identified by its SHA1. if you don't like SHA1, substitute some other function which makes an identifier given an input string, but if the filesystem can maintain the notion of a "file" using inodes, so can manifests.

as far as I'm concerned, I'd keep doing what I'm doing now, reuse the block + delta storage system for *storing* manifests, too. they're data too.

This also interoperates nicely with network stuff - To fetch a version from a depot, you fetch the manifest, and then fetch all sha1 blocks listed in the manifest, that you don't already have. As simple as that.

hm, for the network I have a broader idea: use hash trees over the entire space of SHA1 to synchronize my collection + your collection into the union of both (on both ends). with some special accounting to manage singletons and tombstones, and a good spread factor, it's very efficient to synchronize hash trees, and I'd use the exact same scheme to sync the collection of blocks, the collection of manifests, the collection of files, the collection of keys, and the collection of certs.

I'm cooking up a little proof of concept for this scheme now. it'll be a few days before I have anything working, but I'm pretty excited about it.

There's no need to have a staging queue for the depot.

indeed. removing queues and all the associated logic is something I would very much like.

About NNTP, email, "dumb web" distribution - all you have to do is record, for each block whether or not it was sent to a specific destination.

no, I think I'd just remove these things altogether, let people sync between databases directly as the primary mode of operation.

* Requires more storage and /or bandwidth; A one
  byte change in a 100MB file would cost ~500KB with this
  scheme, and ~ ~100B with the current delta scheme.

nah, it's not quite so bad. a 100mb file will have something like 10
blocks of 10mb each. the script for building it will be at minimum 320 bytes long (though I'd prefer a human-readable script form, so more like 500 bytes). inserting 1 byte will add perhaps 10 bytes to the script, plus require transmission of a new script: 510 bytes total. deleting 1 bytes will probably split 1 extent reference into 2, adding maybe 50 bytes: 550 bytes total.

for a heavily edited file, it'll slowly get worse, but maybe you could have a "defragment" routine which builds some fresh blocks (especially if a bunch of blocks appear with refcount=1; might as well toss them)

* Latest version, unless properly cached, will take longer to
  construct (need to pull all blocks, which will be scattered
  in the database). And proper caching costs space....

true. this could get to be a noticable cost. again, defragmentation is
a possibility, or else just building an LRU file cache in the db. I'm not adverse to that.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]