[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] archive storage format comments on the size

From: Andrea Arcangeli
Subject: Re: [Gnu-arch-users] archive storage format comments on the size
Date: Tue, 30 Sep 2003 03:11:32 +0200
User-agent: Mutt/1.4.1i

On Mon, Sep 29, 2003 at 05:46:07PM -0700, Tom Lord wrote:
> Furthermore, in a general purpose changeset format, rather than one
> just for archive storage, bidirectionality and context are certainly

bidirectionality is definitely preserved, that's not the issue.

context I understand is only used during merges.

Oh, one idea could be to nuke the whole context and duplicated data
during the superpatchset generation. superpatchsets won't handle
in-between mergers anyways.

Frankly I can't care less how big a patchset is for the last 300 tar.gz
ones. I care how big it is for the ones I wrote two years ago and that
I'll never look at again, but that I still want around just in case to
retain all the information.

That way you get not only the most efficient compression, but you nuke
some of the not necssary plaintext payload as well first without any
disavantage. The superpatchset thing is exactly meant for things that
are never used for merging so the context is useless in a superpatchset.

Then you can use the usual cacherev on top of the patchset. But the
first checkout of the patchset will be very efficient too (just quite
costly in terms of disk space, so it maybe preferable to suggest
superpatchsets not bigger than 100M uncompressed).

> vital.   So if the dumb-server archive format were to reject
> bidirectionality and context, that would mean that it could not
> re-use a general purpose changeset format.

I sure like the dumb-server to keep its genericity.

> I am uncertain of the usefulness (and even the meaning) of the
> measurements you've offered.   In any event, each time I or someone

My measurements are definitely important to me. x2-6 space overhead in
the archive you acknowledged is definitely significant to me.

> working with me has quantified some of these issues, over a fairly
> broad sampling of archives, the result has at least subjectively been
> that the current trade-offs are quite reasonable.   I'm not sure how
> much more can be said about that without a larger design or deployment
> context. 
> It is perhaps worth observing that when optimizing for `get', we
> already use client-side revlibs and server-side cachedrevs -- further
> trading space for time -- and now some users are working out how to
> add summary deltas: yet another space for time trade-off.   Generally,
> the consensus is that, at these scales, time is far more precious than
> space.  

If you implement the superpatchsets I proposed, you will get space time
and network at the same time. The only assumption they make is that
you're not going to merge against that obsolete old data often.

I believe the only annoyance could be if you want to search into the
patchsets with cvsps -f equivalent, but even that will run an order of
magnitude faster by unpacking the whole superpatchsets and only working
on uncompressed diffs in /dev/shm. Or you can split a superpatchset  in

and inside the superpatchset since merging is forbidden you should
definitely nuke all the contexts. they can be regenrated at
~zerocost during the superpatchset split operation.

> An exception to the general rule of the concensus is revision
> libraries which are quite suitable for many projects, but clearly not
> suitable (when used as a fully-populated library) for large trees
> managed by people with less than the latest and greatest hardware.

I guess that will change after they're hardlinkable. Even you've the
greatest hardware you'll simply run into troubles much later, you still
want hardlinks and the archive storage to be as compact as possible.

> That is why you can find threads on the list talking about
> "interpolated diffs storage" for revision libraries, why the bug
> tracker has a request for options to `library-add' to make sparse
> libraries easier to manage, and why there's some discussion about
> where and what sort of additional hooks to drop in to the code to help
> manage sparse libraries.  Of those solutions, all but interpolated
> diffs are trivial changes -- and so they seem a good "fit" for an
> economic situation in which, in a few years, most of the issues will
> fade away.  Interpolated diffs are interesting because, regardless of
> storage costs, they support a very fast implementation of `annotate'.

NOTE: when I care about space considerations, I only care about the
_archive_ and the _network_ fetch loss-lessy (i.e. the whole granular
list of patchsets, all of them). The only thing you have to backup in

The caches and revlibs or whatever generated locally and temporarily,
not storing unique info, can grow as much as we want, gigabytes and
gigabytes. It sounds perfectly fine to grow everything optionally
_locally_ and to throw lots of ram and disk space at it, like revision
libs (hopefully hardlinkable soon etc..) or skiplists of patches to
reduce the number of patchsets to log(N) to boost checkout performance.
People will choose what better first their project in function of their

But the archive should be as compact as possible, because that's the
thing I don't want to delete even after I finished working on it. Assume
I have 4G of scratch, I'm fine to use at least 50% of it for cache. then
when I change project I can delete it all. But the archive with the
commits and the year old data in the superpatchset not.

Andrea - If you prefer relying on open source software, check these links:

reply via email to

[Prev in Thread] Current Thread [Next in Thread]