[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] archive storage format comments on the size

From: Andrea Arcangeli
Subject: Re: [Gnu-arch-users] archive storage format comments on the size
Date: Tue, 30 Sep 2003 02:08:46 +0200
User-agent: Mutt/1.4.1i

On Mon, Sep 29, 2003 at 07:10:43PM -0400, Miles Bader wrote:
> On Tue, Sep 30, 2003 at 12:39:24AM +0200, Andrea Arcangeli wrote:
> > >    The space inefficiences in arch are that it adds: contents of
> > >    deleted lines and files, context of diffs, an extra copy of 
> > >    the log file, and some overhead costs associated with using
> > 
> > why can't the not strictly needed stuff be removed? We know the
> > patchsets can't reject during checkout, why should we carry all this
> > overhead with us when that can be deduced at runtime?
> The question of course, is `what's not needed?'
> Even if you're only talking about a change to a single file, a delta in a CVS
> file and an arch changeset are rather different things: the CVS delta can
> generally only be applied in the strict context in which it was generated, it
> is almost useless in any other context.  An arch changeset, OTOH, is useful
> in many contexts (and this isn't just a theoretical advantage either, many
> merging scenarios have you applying changesets in a different context from
> which they were created).  The current arch changeset format is optimized for

Actually IMHO the biggest benefit ever of the patchsets in arch format
are the zerocost tags (i.e. zerocost branch creation). The very reason
I've a chance to manage 300 branches more or less. This should never be
possible to be achieved in other revision control systems where they
spread the changesets into a per-file storage. I asked Linus once how
fast bitkeeper could create branches when he switched to it, just to
evaluate if I could use it to manage my pure branches. I racall he told
me the branches were much faster than cvs, but still it was of the order
of dozen second IIRC. And here deleting a branch is a rm -r, it's so
much more confortable to do branches without per-file storage.

> this sort of flexible usage, instead of for raw storage efficiency.  I like
> to think of arch as being like the traditional `trading patches' style of
> development, except with all the record-keeping taken care of for you.

Exactly, this is the very same way I see it too ;) i.e. a patch
management system more than a revision control system, and that's why it
convinced me it could be the way to go.

> I suppose the `extra' info could be deduced somehow, but that obviously adds

that's my point. It's not that it won't be a patch management system
anymore if you diff with diff -u0. The storage should be just compact
and efficient, it doesn't matter if you can parse it with your eyes in
a text editor. Even cvs and SCCS are unreadable, this has to be
expected, if you want it compact it likely won't be readable by humans
(I mean, not easily).

It's the concept of global patchsets that remains, the tag will be still
zerocost, no matter how we store the patchset, if the patchset contains
the very same information. Then we can trade some disk space with some
cpu time for merging with a diff -u0.

It's hard for me to tell how many times you would need to regenerate the
diff -u2 from a -u0, all types of merging are definitely the case, since
the -u2 is the heuristic that says "yes there are no rejects in this
merge" . It's also hard for me to evaluate how costly it would be to
regenerate it in all possible scenarios. There may be corner cases that
prevents us to us the -u0 reasonably efficiently.

> additional overhead.  I think that would be especially noticable with the
> current `dumb server' network model of arch -- any excessive trawling around
> in a remote archive kills you due to the network latency; presumably a `smart
> server` could use different tradeoffs.

As said for the checkouts nothing will change with diff -u0, nor for the
mirroring. If you need to merge and regenerate the diff -u2 often, then
you can cache things on the client first.

> > you can sure solve problems by throwing money into the hardware, these
> > days storage is exceptionally cheap than it has ever been, but I don't
> > normally take it as a good argument while developing software
> Yeah, sometimes it drives me nuts when Tom uses that argument -- if I had
> excess disk space (I don't!) I'd rather use it to store more _source_, unless

disk space like cpu speed will never be enough. We want the very best
software technology, no matter how complex it is to build it.

> the inefficiency buys me something.  In the case of arch changesets,

agreed, if the inefficiency buys something relevant then it's sure

> umm.... I'd say the increase flexibility is worth it (a smart server may be
> the way to go for optimizing disk space in the future, without losing
> flexibility), but e.g. in the case of .arch-ids/*.id files, I don't think I
> gain enough to offset the overhead.

100% agreed. And frankly I don't care much about the size of the working
dir ;). I only care about the size of the archive.

> > I understand you have to stat all files in the tree (I don't want to tla
> > edit), but I don't see why you've to stat all the internal _patchsets_
> > metadata inside the {arch} directory. I just don't see that.
> I think it's just that for arch, {arch} is _not_ a special case for most
> operations -- it's just treated as part of the source tree when
> making/applying changesets etc.  I think this is _very_ clever, in that it
> simplifies the implementation greatly by not requiring tons of special cases
> to handle arch meta-data.  Perhaps there are optimizations could be done
> based on knowledge of the structure of {arch}, but I think that's something
> that requires careful thought, as I don't think you want to change the the
> _semantics_ of {arch] at all.

So this sounds really something that *has* to be optimized.  I totally
agree this is a cool and powerful property (threating {arch} the same
way as the reel data). But there is no way at all I could ever edit the
patchset files, so those patchsetfiles (the only ones generatign the
huge N in O(N)) should be threated magically. They're the only ones
increasing to thousands and thousands.

actually a tla edit may be ok too for quicker checkins of big trees like
the kernel, but for normal operations I'm fine to wait a second before
commit, as far as it doesn't go look into the patchset files in the
{arch} directory. There are tenths thousand of patchsets in 2.5, I don't
really want to wait 10 seconds.

> > Is diff -u0 one of the ideas that will be implemented?
> Are you _really_ sure that's what you want?  It would be very dangerous
> (for the same reason that -u0 patches are dangerous in general)...

They're not dangerous at all in the checkout and mirroring operations.
Of course during merging I _need_ -u2, that's the heuristic that chooses
if something rejects or not. But I'm sure the -u2 can be regenerated
dynamically in function of the -u0.

> I don't know, maybe there could be some sort of `archive crunch' operation
> that when through an archive and reduced the amount of context information in
> changesets, making them applyable only in the strict context of their branch
> (and of course some note should be made of this so that tla would refuse to
> do otherwise)...

I was thinking to have this by default, and to always regenerate the -u2
during merging. We should simply evaluate how costly it is to regenerate
it in all merging scenarios.

Andrea - If you prefer relying on open source software, check these links:

reply via email to

[Prev in Thread] Current Thread [Next in Thread]