gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Re: [BUG] FEATURE PLANS: "perfect" summary deltas


From: Aaron Bentley
Subject: [Gnu-arch-users] Re: [BUG] FEATURE PLANS: "perfect" summary deltas
Date: Thu, 03 Jun 2004 09:13:52 -0400
User-agent: Mozilla Thunderbird 0.5 (X11/20040309)

Tom Lord wrote:
Here's a plan for self-managing summary-deltas and cached revisions
with no changes to archive format (beyond adding "system branches")
and with simplifications to the revision builder.

I'm not sure this is the right approach.  It's nice to reuse existing
conventions and code, but in this case, I'm not sure that the problem
and solution are a good fit.

To me, one summary revision per summarized revision seems like a lot.
For typical projects, this will more than double the archive size.  I
doubt we'll see significant improvements in build speed when we avoid
merely 1 patchset.

What are the cases?  Here's what I see:

Alice: New downloader.  Someone doing their first get, without any
related revision in their library

Bob: Follows the releases of a project.  Updates to the latest revision
when it it's gone gold.

Charlie: Runs the latest devel code.  Updates frequently.  Either a
developer on the project, or some kind of fanboy.

For Alice, cacherevs are the best thing going.  If there's no nearby
cacherev, but there is a summary delta, she'll start with the cacherev,
then apply the summary-delta.

Bob gets the most bang out of summary deltas, because they allow him to
skip from one release to the next, without a hundred or so intervening
patches.  But Bob's needs are pretty predictable-- we can make pretty
confident guesses about which revisions he'll get.  So while summary
revisions do make sense for his case, we don't need anything like a 1:1
ratio.  1:100 can be fine, as long as they're the right 100 revisions.

I don't think we can serve Charlie much better than we currently do.
Since he's only 2 or 3 revisions away,  we can't save him much bandwith
or changeset application time, even if we have the exact right summaries.

Charlie occasionally needs to get an old revision for a star-merge.  The
backbuilder helps him there, since he can use a recent revision to
produce it.  But since this is an old version, Charlie's probably
already downloaded it once.  If he has it in a mirror, that's better
than a remote archive.  And if we had archive caching, we could ensure
that he never had to download the same patch twice, yet never mirror.

So it seems to me that for many purposes, it would be fine if summaries
happened only between cacherevs.  That way, Alice could get the
cacherev, and Bob could use the summary to move from the previous
release (which was also cachereved) to the new one.

Benefits of summary deltas:
I haven't done significant testing, just tla--devo--1.2.  For that
project, the delta is 334K, and the aggregate of all patches is about
493 K, so it's about 33 % smaller.  That's certainly an improvement, but
it has the cost that you can only use it for skipping from the beginning
to the end.

I'm also not entirely happy with using system branches for storing
summary deltas.  The thing I don't like about it is that it mixes
archive format issues with namespace issues.  A smart server would not
need to maintain summary revisions.  Instead, it could produce requested
deltas on the fly, perhaps keeping the most popular ones in cache.

Instead of looking for summary deltas on the client side, the protocol
up to archive.h should ask "What can you give me to produce foo from
bar?"  Dumb servers can respond with plain old revision and summary
deltas, and smart ones can say "I can give you delta-foo---bar".

Also, it's likely that sooner or later, people will start changing
history by removing summary deltas from mirrors.  That's a bigger
problem if summary deltas are part of the namespace than if they're just
part of the archive.h protocol.

Also, this approach restricts summary deltas to only summarizing the difference between revisions within one branch, and there may be cases where intra-version/intra-branch (e.g. update from tla--devo--1.2 to tla--devo--1.3)

`summarize' ensures that every revision in the summarized branch,
a corresponding revision exists in the summary branch.  This
invarient is maintained:

        % tla get proj--branch--1.0 wd

yields exactly the same tree as:

        % tla get proj--branch+summary--1.0 wd
        % tla remove-log-version --dir wd \
              proj--branch+summary--1.0

I'm also not thrilled with removing patchlogs to produce a target revision, but I can handle it.

For every summary branch there is a version variable, "summary-basis"...

I worked on that script with Greek0 for a while.  The closest thing I
saw was a script that output revision ordinals of mostly 2.  It looks
like you want it to scale by powers of 2, which is usually good, but to
reduce download size, we don't want 2s, we want 16s and 64s and 128s.
That's when we'll see the big efficiency improvements.

Ignoring the exception I've promised to the preceding rules for a
moment, let's suppose that I _have_ a REV_HAVE revision of the summarized version and that I _want_ a REV_WANT revision of the summarized version.

Let's further suppose that the corresponding REV_HAVE and REV_WANT of the _summary_ version (the "+summary" version) both have the same
"Summary-basis:" header.  Call the basis revision named there, BASIS.

Then I can construct REV_WANT as follows:

LET

   wants_simpler := 2^(floor (log_2 (relative_revision_ordinal (BASIS, 
REV_WANT))))

   to_simpler := 2^(floor (log_2 (relative_revision_ordinal (BASIS, REV_WANT))))

Aren't these the same equation?
How could wants_simpler not be equal to to_simpler?

Anyhow, given a list of deltas, I can certainly get the backbuilder to use the appropriate deltas to build the revision. There's some support for deltas already, and once I've got auto-reversing implemented, the rest should be easy.

That algorithm will build any revision from any other revision sharing
the same summary basis in at most 4 changeset applications.

Is that the point of deltas? Reducing the number of changeset applications? While that seems kinda nice, it doesn't look like a big win to me, especially if changeset composition has the benefits you envision.

(p.s. one way to speed up changeset application is to avoid snapping inode-sigs until the last changeset has been applied. This can probably be done in less than 10 lines of build-revision code)

The exception with summary branches concerns changes of summary basis:

If creating a new summary revision as described above would result
in a situation where following the construction algorithm described
above would fetch changesets from an archive whose total size exceed
the size of a cacherev of the new revision, then to create the new
summary revision, tag the summarized revision exactly and change the
basis to the newly committed revision.

For tla--devo, this looks like an extremely rare case. The cacherev for tla--devo--1.2--base-0 is 1.4 M, and but the aggregate patch size for 115 patches is 493K.

Alice would prefer a lower threshold, since Alice ideally wants a cacherev to exist for her target revision. For Bob, though, this is a good rule.

Anyhow, that's my take.  Make what you like of it.

Aaron





reply via email to

[Prev in Thread] Current Thread [Next in Thread]