[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: situations where cached revisions are not so go

From: Jason McCarty
Subject: Re: [Gnu-arch-users] Re: situations where cached revisions are not so good
Date: Sat, 27 Sep 2003 21:27:07 -0400
User-agent: Mutt/1.5.4i

Tom Lord wrote:
>     > From: Jason McCarty <address@hidden>
>     > Well, it's many hours later and I've done a little benchmarking. I don't
>     > want to report my results until I reach a conclusion about a good course
>     > of action, but initial tests are actually very promising. Summary deltas
>     > taken every 200KiB in tla--devo--1.1 are on average half as big as the
>     > cumulative changesets they span. CPU usage is much lower than applying
>     > each revision individually, by close to two orders of magnitude (I
>     > wonder if this is a bug in tla).

Actually I forgot to include the tree-copy/untar/gzip time, to make my
test analogous to the normal process (copy a full tree and patch it).
It's more like a 5-25x speedup when I include those.

> That's cool and thanks for exploring the issue.
> There is some design work to do if these are to be implemented.
> I've little doubt that summary deltas can, in principle, significantly
> reduce network traffic and build_revision time compared to not having
> tuned local caches and mirrors.

I would still keep local mirrors in the picture; they provide the same
benefits with or without summary deltas.

As to your design questions, Here's a simple proposal:
  Summary deltas will be stored in the same directory as their target
  revision (i.e. a summary of patch-32..57 would go in patch-57's
  directory). I don't make use of cachedrevs, but they could be
  integrated if desired (for example if a revision is really huge, a
  cachedrev might be better than summaries).

  Implementation of arch_build_revision(ALMOST, TARGET, DESTDIR):
  1) If ALMOST is nil:
     2) Try to find a pristine or libraryrev ALMOST which is the
        greatest lower bound of TARGET, with the same version as TARGET.
     3) Set ALMOST to TARGET's base-0.
  4) If ALMOST == TARGET, copy ALMOST to DESTDIR and return (may have to
     call arch_build_revision if it's a continuation, or connect to
     archive to grab import).
  5) Connect to TARGET's archive.
  6) Find a DELTA to TARGET whose FROM endpoint is the least upper bound
     of ALMOST (this will be either a revision or a summary delta in
     TARGET's directory). Continuations count also; let FROM be the
     revision it's a continuation of. Use a continuation from another
     version if it's all that's there.
  7) If DELTA is a continuation from a different version than TARGET:
     8) Call arch_build_revision(0, FROM, DESTDIR).
     Else call arch_build_revision(ALMOST, FROM, DESTDIR)
  9) Retrieve and apply DELTA to DESTDIR.

Allowing reverse patching to build revisions could give a 2x speedup on
average, but I think you might then need to store an auxilary index of
the summaries for each version, to quickly construct the global graph.
The implementation would likely be significantly more complex.

> * can summary deltas "overlap" or must they not?

It would not cause any problems, but if they aren't transitive, then my
algorithm's performance may be less efficient than a human's analysis of
which deltas to apply. So if you have 36 and want 134, the summaries
{(12, 75), (75, 130), (12, 130)} will patch with 36..75->130..134, but
given the summaries {(12, 75), (75, 130), (39, 129)}, my algorithm will
do 36..75->130..134 when it should do 36..39->129..134. This seems to
indicate to me that a n-ary tree of summaries (possibly chopping off the
top few nodes) would yield the best time/space tradeoff.

I think that transitive deltas would probably be most effective
regardless of the algorithm.

> * is their placement on dumb servers entirely manual (user chooses the
>   endpoints), or automated (an algorithm chooses the end points).

It could be automated by an algorithm. A user might want to add or
remove some of them manually where changeset size is highly variable.

>   If an algorithm, which algorithm?   It seems to me that there are
>   several that are optimal according to various metrics.   Should
>   their placement be deterministic so that a client can know which
>   ones might be there based only the patch-level?   Or must clients
>   search for them?

Yes, there could be many possible algorithms, and I'm still
investigating which ones might be optimal. The one I tested yesterday
placed summaries at every 200KiB of gzipped revisions (about 10% of
total size); it turned out to have a bit more variability than I would
like, but it was still quite effective. Apparently it should take into
account the number of changesets each summary spans. Placing one every
20 revisions or so might also be fairly effective. And then as I
mentioned above, summarize the summaries like a tree, up to a certain

> * what strategies might a smart server want to employ to manage 
>   summary deltas?   Feedback from access patterns?  Any-one-on-demand?
>   Algorithmic placement based on the length of the ancestry chain?
>   or on the sizes of various changes and trees?

It would probably insert more summaries where revisions are accessed a
lot, and remove some where revisions are accessed rarely (although it
may actually be faster to build a summary on the spot in some cases if
someone wants a rare revision). I would suggest placing them in order to
minimize the time it takes to build revisions (dependent on size and
more strongly on chain length right now).

> * What search must clients perform to find summary deltas and to what
>   degree will the network costs of that search outweigh the benefits
>   of summary deltas?   For example, let's suppose that we have an
>   archive with:
>       patch-A         <-- most recent revision I have on-hand locally
>         ...
>         patch-N               <-- there's a summary-delta from A..N for this
>         ...
>       patch-X         <-- there's a cachedrev for this
>         ...
>         patch-Z               <-- this is the revision I want to build
>   Is it better to grab cachedrev X or summary-delta N?  Must my client
>   search from Z back to N looking for summary-deltas?  If I add a new
>   summary delta at patch-Y, and the server contains "hints" about
>   where to find summary deltas, what are the transactional issues
>   associated with keeping those hints up-to-date?  How large is the 
>   data associated with such hint-records?

The search for summaries is comparable to the search for cachedrevs; the
cost of listing them may be higher than finding cachedrevs if there are
several summaries whose target is the revision you're looking at. A
smart server could automatically return the "best" one.

If the archive is local, grab the cachedrev. Otherwise it's almost
certainly better to grab summary-delta N in this case, unless your
changesets are very large compared to total tree size (remember one
cachedrev for tla's tree ~= 100 revisions).

No hints are necessary; summaries will be treated much like cachedrevs.

>   Aren't we going to just create _more_ ways to have "the problem"
>   with cachedrevs: that as an optimization they are sometimes
>   pessimising?

Sure, but I think the tradeoffs involved are somewhat more managable and
flexible. Of course, flexibility could be a downside when you want
things to Just Work, but I don't think it's a huge problem here.

> * How should summary deltas be named, discovered by clients, and
>   downloaded?    Is the idea here to add new summary-delta-specific
>   protocol to the interface to archives?   New generic protocol that
>   can work for summary deltas and other things besides?
>   In designing new protocol, aside from concerns about "creeping
>   featurism" in the archive protocol, one has to worry about 
>   the new costs in server round-trips, and as well about what smart
>   servers will do and how does that come out client-side?

I think they can be very analogous to cachedrevs, with the difference
that there can be more than one of them for a given revision. I would
suggest the summaries be called something like
cat--branch--vsn--summary-A-B.tar.gz, where A and B are the revision
numbers (it would be nice if they could be only numbers, but
version{,fix}-* screws that up, so...). I don't think server roundtrips
are significantly increased, but I haven't examined that. I don't think
I'm qualified to comment on other protocol design issues.

> Meanwhile, I think the mechanisms we currently have have the virtues
> of being simple, understandable, controllable, and tunable to
> near-optimal behavior:

For the case where you work a fair amount on a project, yes. Then just
mirroring the whole archive is fine. I think that even for a local
archive, though, summaries are a space-saving alternative to cachedrevs,
while only being a little slower. For your archive, just 8 summaries
totaling 900KiB (less than a single cachedrev) would let me build any
revision in less than 25 seconds, as opposed to as much as 90 seconds
without any summaries or cachedrevs. Doubling the density of summaries
(and summarizing the summaries) would halve this time, and use less
space than 2 cachedrevs.

For the case where you only access the archive remotely, summaries are a
bigger win, simply due to the reduced network traffic.

> Working on a big remote project?   Mirror the relevent parts of its
> archive, sans cached revisions, and do the rest locally with
> cachedrevs and rev-lib entries.    The bug queue has a feature request
> for better sparse-library support and a pending merge that fixes a bug
> in the cachedrev-ignoring-feature of the mirror command.
> Mirroring that way means that my network traffic with the server is
> reduced to a single instance of fetching each baseline changeset.
> That's about as close to optimal as you should need to get most of the
> time.

That's how you would mirror with summaries as well, only grab the
revision changesets.

> Summary deltas, I worry, are going to not really solve the problem
> (oh, it will work for some archives and access patterns, all right,
> but pessimise others).   They'll be hard to control usefully and
> slightly confusing.   They'll add plenty 'o new code to achieve that
> dubious end.

As you can see from my algorithm, the amount of extra code is not that
large. The tla code to place the summaries can be dumb (not much more
than revdelta + tar/gzip), as I think the endpoints should be calculated
by an external program.

I don't want to address your other concerns until I'm convinced myself,
but I think it's plausible that they can be addressed. Perhaps Miles
will have some comments also.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]