[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] database compaction

From: Nathaniel Smith
Subject: Re: [Monotone-devel] database compaction
Date: Wed, 17 Oct 2007 18:16:16 -0700
User-agent: Mutt/1.5.13 (2006-08-11)

On Wed, Oct 17, 2007 at 09:35:35PM +0200, Markus Schiltknecht wrote:
> I've recently been trying to implement some of the ideas from the 
> database compaction wiki page [1].


> == Compact Heights ==
> As the wiki says, uleb128 wouldn't work too well, because compressed 
> heights couldn't be compared with memcmp(). As I've had to do a lot with 
> JPEGs recently, I've stumbled across Huffman encoding. Given a 
> hand-crafted Huffman Tree, the memcmp() property could be maintained. 
> But even though that idea fascinated me, and I've already begun to 
> compile statistics about distribution of height values, I had to realize 
> that the potential gain is very limited. As an example, my current 
> working database is 618.1 MiB. Trimming all heights to a single-byte 
> value - which is way too optimistic - results in a 612.8 MiB file. 
> Maximum savings: less than 1% :-(  Thus overly complicated things like 
> Huffman are definitely not worth the effort. I didn't write any code here.

I'm honestly not sure that we even get any benefit from using memcmp
comparison on heights.  I think the only way we use heights is to load
them into memory, and if data is moving from sqlite into mtn proper
than doing any kind of conversion on it is trivial and essentially
free anyway -- I only brought up this memcmp comparison thing in the
first place in case it let us do something clever with sqlite indices,
and AFAIK we aren't.

> == Putting heights in the revisions table ==
> I'm not sure why exactly heights haven't been stored in the revisions 
> table from day one on, but I certainly think it's a good idea. It saves 
> lookups and increases locality in case one needs the heights. And I 
> don't think it hurts much if you don't. To check confirm or reject these 
> assumptions, please refer to revision 45501e62.

The tricky part about this is the regenerate caches stuff -- revisions
are sacrosanct, because they give "ground truth", while heights may
get wiped out and regenerated.  So the db needs to be able to
represent "here are all the revisions, none of them have heights
assigned", and we need to be able to detect when the db is in this
state.  (And the logic around figuring out what state the db is in at
any given time is already kind of convoluted :-/).  If you've solved
that, though, cool, it's probably doable.

-- Nathaniel

"Of course, the entire effort is to put oneself
 Outside the ordinary range
 Of what are called statistics."
  -- Stephan Spender

reply via email to

[Prev in Thread] Current Thread [Next in Thread]