[Monotone-devel] Re: database schema questions

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: database schema questions

From:	graydon hoare
Subject:	[Monotone-devel] Re: database schema questions
Date:	Wed, 21 Apr 2004 10:50:39 -0400
User-agent:	Mozilla Thunderbird 0.5 (X11/20040208)

Derek Scherger wrote:

Does the files table contain the heads of various files, and thefiles_delta table contain deltas from the head backwards to eachprevious parent version? I think this is conceptually similar to how rcsstores files (reverse deltas), and the reverse of how sccs does (ordid). Presumably manifests and manifest_deltas work similarly.

yes. though the storage organization is opaque to the rest of theprogram; there is no requirement that the database store "heads" (in anancestry sense) in the file table and "non-heads" in the "file_deltas"table. it happens to do so frequently, but it could also put allversions in "files" and do no delta storage at all. deltas are just amethod for the storage system to compress things.

this is a very important and rarely-stated aspect of monotone's design:file and manifest storage is *unrelated* to history. history is adescription of what happened, in terms of certificates which point toSHA1 values. the method of constructing the data for a given SHA1 valueis *completely independent* of that history.

storage happens to be built more or less in parallel with history -- inthe current scheme -- so often shares a similar graph shape, but wecould use any other scheme we liked too. some people have suggested"fragment buckets" -- essentially running xdelta over every N-byte blockin the entire history -- and I'd be willing to do that if I could comeup with an efficient implementation of the match algorithm. likewisesuffix trees or burrows-wheeler blocks or whatever. the storage layer isdecoupled[1].

It seems that branch certs are "sticky" in that they will get added tosuccessive manifests that descend from a parent in the branch. Is thisdone purely with the branch setting in the MT/options file?

yes, or the --branch command-line option. they are normal certs, theyjust happen to be added for you automatically.

Given the previous paragraph would it make some sense to store manifestsas a table with columns like manifest_id, file_id and file_name?

yes, but that would consume more bytes than the current form. in somecases quite a few more bytes. it is important to note that we storemanifests as *gzipped* text. this means we get free statistical datacompression. look at a large manifest (say gcc: about 900k). most of thebytes on most of the lines of that manifest are identical path prefixes:

...gcc/testsuite/consistency.vlad/layout/i960-97r2-results/c-longlong-2-c-short.out++

gcc/testsuite/consistency.vlad/layout/i960-97r2-results/c-pointer-1-c-char.out++
gcc/testsuite/consistency.vlad/layout/i960-97r2-results/c-pointer-1-c-double.out++
...

that happens to compress very well; in this case it is nearly 10:1compression. even with the subsequent base64 encoding, the result isonly 130k.

now, we could possibly achieve the same (or even better) compressionresult "explicitly" at the storage level by storing directories as firstclass citizens, and putting files as leaves in directories, and chainingthrough the directory table to get to a file.. but that adds a lot ofcomplexity. I didn't want to bother.

This would allow any file to get all of the names it is known by and a listof all the manifests it is part of very easily.

...

I do see that this represents some additional complexity in that filesand manifests are then stored and treated differently. But would therebe some advantage to making file names easily available? I have theimpression that getting a file name is somewhat involved in terms ofunpacking manifests and constructing the associated maps. I'm alsowondering whether a manifest merge under this representation would bepossible as a simple sql union of two other manifests.

there is certainly an argument to be made. early versions of monotonedid represent data this way. I removed it for the following reasons:


 - there is a big space efficiency win with blobs of text (gzip)

 - merging is more difficult than SQL union in all but the most trivial
   cases, so you wind up having to construct a memory representation
   anyways. diff_patch.cc::merge3 is one of the largest and most
   complex functions in monotone (300 lines), and that's the simplified
   version. merging trees is kinda gross.

   notice the "union oriented" vision I initially had (struct
   manifest_changes, in manifest.hh) turned out only to be useable as
   a stepping stone in building the "real" representation of inter-tree
   changes, which is struct patch_set, in patch_set.hh. as for
   manifest.cc::apply_manifest_changes, it is actually dead code now.

 - my representation of "a file" at the database level is really not
   compatible with the notion people have of "a history line". storage
   delta chains can be broken, or mis-ordered, or whatever. they don't
   necessarily correspond to time or work. so suppose you want to read
   off the "history" of a file; you cannot say "look at the file_deltas
   table for its predecessor" because that might be unrelated to the
   record of history. same is true or manifest_deltas; it might not
   actually describe history. only the ancestry graph describes history.
   so you need to go:

     file->manifest->cert->manifest->file

   each step of the way. storing things in a more explicit fashion
   doesn't seem to win you much given algorithms that operate this way.

if you want to conduct an experiment and try rewriting part of theschema in terms of explicit directories and manifests-as-tuples, I won'tstop you. it's possible it would be a space win and maybe simplify somealgorithms. I found the current arrangement beneficial though.


-graydon

[1]: small fib: the CVS importer and some of the old, dead queue-walkingcode manipulate deltas directly. they do this for speed. you could fakethem on another storage system, but they'd slow down.

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] database schema questions, Derek Scherger, 2004/04/21
- [Monotone-devel] Re: database schema questions, graydon hoare <=
  - [Monotone-devel] Re: database schema questions, Derek Scherger, 2004/04/21

Prev by Date: [Monotone-devel] Re: write access to my public server (was Re: [PATCH] add xemacs ...)
Next by Date: [Monotone-devel] Re: write access to my public server (was Re: [PATCH] add xemacs ...)
Previous by thread: [Monotone-devel] database schema questions
Next by thread: [Monotone-devel] Re: database schema questions
Index(es):
- Date
- Thread