[Monotone-devel] Re: Support for binary files, scalability and Windows p

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: Support for binary files, scalability and Windows p

From:	graydon hoare
Subject:	[Monotone-devel] Re: Support for binary files, scalability and Windows port
Date:	Fri, 16 Jan 2004 18:44:30 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4

Asger Kunuk Ottar Alstrup wrote:

OK. The 16 MB limit is a showstopper for us, and probably the hashing as
well.  We are working with raw video files that are on the order of 100
MB big.

yeah. those are really not going to play well. not only hashing will bea problem; it is also the case that monotone sometimes keeps 1, 2, inrare cases 3 copies of a file in memory concurrently, in std::strings.

it is really made for source code control. if you need to use it forcontrolling very large individual files, it will need quite a bit ofchanging.

Do you have an impression of what the consequences of changing some
files to use automatically generated unique identifiers instead of
hashes, or partial hashes? In other words, how much of the code relies

on the fact that the id's are hashes of the complete file?

eh, I doubt anything intrinsically *relies* on it; the hash is prettymuch treated as an opaque function from data->identifier. you could intheory change it to use any sort of hashing scheme: partial, smaller,faster, bigger, stronger. there are many to choose from.

Regarding, rcs_import.cc that is too big for a 5 minute inspection, so
do you know in more detail what changes are required?

Maybe the best bet is simply to try, and see what breaks?

yup. I don't actually know for sure that it's broken; my estimate wasbased on how much I could imagine any hypothetical breakage costing, nota detailed analysis. it may well already parse, but if it doesn't that'snot going to take too long to fix. after that you probably just need tomake sure it is decoded (uudecode?) and reconstructed properly (i.e. notusing the ed-line-script thing textual RCS entries use).

Regarding big-file support: I was wondering whether that could be done
without changing sqlite: A big file can be defined as concatenations of
many smaller chunks. Even if you bump the limit to 2^32, it will still
fail for some users, so maybe it's better to come up with a scheme
without such a "low" limit?
I'm not sure how that fits into the architecture, though.

hmm. this comment, and your talk about network distribution, has givenme a lot to chew on. sorry it's taken so long to respond. these areissues which I agonized over quite a bit near the beginning, and amstill not completely happy with. your idea is good. I will proceed totalk out loud for a minute here.

currently, files are stored, as you have probably guessed, as headversions + xdeltas. the xdeltas reach into the past. head versions areidentified by SHA1, deltas are identified by (pre,post) pairs of SHA1s-- the versions on either side of the delta. xdelta is itself basedsomewhat on hashing; common blocks are found by indexing all the blocksin the first file under their adler32 code, and rolling an adler32window of the same block size over the second file one byte at a time.then it writes a copy/insert delta stream. when you make a change theforwards xdelta is queued to send to the network, and the reverse xdeltais stored in the database, moving the head version to your newlycommited version.

this storage management system need not be the case. another feasibleform would involve removing the file/delta distinction and storing allfiles as hash trees, where each hash code identifies an existing datafragment (up to a fixed block size) *or* a subtree.

this would in a sense "speed up" access to old files -- or at least makeaccess to all files roughly equal speed -- and as you mention is wouldintegrate nicely with a type of network distribution we don't currentlyhave any real abstraction for: "synchronizing". syncing hash trees workspretty well. it would also easily permit astronomically huge files.

the difficulty with this sort of approach (well, everything related tothe rsync algorithm really) is getting the constants right.

you have pressure from one side to make the extents you identify large:the fewer extents you write, the smaller your encoding, and the smalleryour summary index of extents is (== less time to search) when you'redoing a rolling adler32.

on the other hand, you have pressure from the other side to make theextents you identify small: finer grained extents means that youidentify more commonality between blocks.

in theory you can speed up the adler32 (well, for whole-databaseindexing we would probably move to a larger adler code) by maintaining amedium-sized bloom filter of all the adler codes in your database. it'simportant to realize how fast xdelta needs to be able to disqualify anadler code: it checks *every byte offset* in second file. you commit a64k file, it will want to check (64,000 - blocksize) adler codes.

How is monotone doing in this area? Full support requires reimplementing
rsync, but maybe monotone can reuse rsync or something?


you know xdelta is based on rsync, right? :))

yeah, I'm definitely curious to figure out a way to run monotone'snetworking the other way around. currently it is "replay-based".committing something queues it for transmission, depots replaytransmissions, etc. this has some advantages -- you can just concatenatecommunication history for example -- but seemingly just as manydisadvantages. it is fragile, order-sensitive, and requires a bunch ofcontortions to compensate for.

I'd be just as happy to scrap the entire existing networking system andmove to a synchronization-based one, custom protocol or otherwise, if Icould work out a really efficient abstraction. synchronization isself-stabilizing, which I'm very attracted to. maybe hash trees are it.

note that any such solution has to synchronize not only file contentsbut metadata collections. but maybe we can structure things in a waywhich supports that. for example, suppose we sort all manifests byattached date cert (or, heck, lexicographically) and cluster them intohash trees with a pleasant branching factor. then when I want to syncwith you, I send you my "highest" summary hash-tree list (a few hundredmanifests) and we sync from manifests to certs and files, then into thefile content hash trees. it might well work.

It seems Zbynek Winkler more or less nailed that one today, so that is
good news.

not sure about that being nailed, but it's certainly making veryencouraging progress. I'd really like to see a native windows portsomeday (both to shed cygwin, and to get a native GUI).

The next step would be to develop a TortoiseCVS/TortoiseSVN kind of
client, but that should not require any changes in monotone as such.

it'll probably require some refactoring of commands.cc, but I'm not atall adverse to cleaning that up. it's a bit of a rat's nest right now.


-graydon

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] Support for binary files, scalability and Windows port, Asger Ottar Alstrup, 2004/01/12
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/12
  - [Monotone-devel] Re: Support for binary files, scalability and Windows port, Asger Kunuk Ottar Alstrup, 2004/01/15
    - Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zbynek Winkler, 2004/01/15
    - [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare <=
    - Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Ori Berger, 2004/01/16
    - Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/17
    - Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Nathaniel Smith, 2004/01/17
    - [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/19
    - Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zbynek Winkler, 2004/01/19
    - Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Ori Berger, 2004/01/18
    - Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zack Weinberg, 2004/01/18
    - [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/19
    - [Monotone-devel] RE: Support for binary files, scalability and Windows port, Asger Kunuk Alstrup, 2004/01/18
    - [Monotone-devel] Re: Support for binary files, scalability and Windows port, Peter Simons, 2004/01/18

Prev by Date: [Monotone-devel] Re: cygwin & *.exe
Next by Date: [Monotone-devel] Re: cygwin & *.exe
Previous by thread: Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Next by thread: Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Index(es):
- Date
- Thread