[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: situations where cached revisions are not so go

From: Tom Lord
Subject: Re: [Gnu-arch-users] Re: situations where cached revisions are not so good
Date: Sat, 27 Sep 2003 08:59:38 -0700 (PDT)

    > From: Jason McCarty <address@hidden>

    > Well, it's many hours later and I've done a little benchmarking. I don't
    > want to report my results until I reach a conclusion about a good course
    > of action, but initial tests are actually very promising. Summary deltas
    > taken every 200KiB in tla--devo--1.1 are on average half as big as the
    > cumulative changesets they span. CPU usage is much lower than applying
    > each revision individually, by close to two orders of magnitude (I
    > wonder if this is a bug in tla).

That's cool and thanks for exploring the issue.

There is some design work to do if these are to be implemented.

I've little doubt that summary deltas can, in principle, significantly
reduce network traffic and build_revision time compared to not having
tuned local caches and mirrors.  (It's helpful and nice to start to
see that quantified, though.)

As practical matters:  

* can summary deltas "overlap" or must they not?

* is their placement on dumb servers entirely manual (user chooses the
  endpoints), or automated (an algorithm chooses the end points).

  If an algorithm, which algorithm?   It seems to me that there are
  several that are optimal according to various metrics.   Should
  their placement be deterministic so that a client can know which
  ones might be there based only the patch-level?   Or must clients
  search for them?

* what strategies might a smart server want to employ to manage 
  summary deltas?   Feedback from access patterns?  Any-one-on-demand?
  Algorithmic placement based on the length of the ancestry chain?
  or on the sizes of various changes and trees?

* What search must clients perform to find summary deltas and to what
  degree will the network costs of that search outweigh the benefits
  of summary deltas?   For example, let's suppose that we have an
  archive with:

        patch-A         <-- most recent revision I have on-hand locally
        patch-N         <-- there's a summary-delta from A..N for this
        patch-X         <-- there's a cachedrev for this
        patch-Z         <-- this is the revision I want to build

  Is it better to grab cachedrev X or summary-delta N?  Must my client
  search from Z back to N looking for summary-deltas?  If I add a new
  summary delta at patch-Y, and the server contains "hints" about
  where to find summary deltas, what are the transactional issues
  associated with keeping those hints up-to-date?  How large is the 
  data associated with such hint-records?

  Aren't we going to just create _more_ ways to have "the problem"
  with cachedrevs: that as an optimization they are sometimes

* How should summary deltas be named, discovered by clients, and
  downloaded?    Is the idea here to add new summary-delta-specific
  protocol to the interface to archives?   New generic protocol that
  can work for summary deltas and other things besides?

  In designing new protocol, aside from concerns about "creeping
  featurism" in the archive protocol, one has to worry about 
  the new costs in server round-trips, and as well about what smart
  servers will do and how does that come out client-side?

Overall, I don't want to rule the idea out -- far from it.   I think
there might be good answers to the questions raised above but I'd like
to see them before leaping into this.

Meanwhile, I think the mechanisms we currently have have the virtues
of being simple, understandable, controllable, and tunable to
near-optimal behavior:

Working on a big remote project?   Mirror the relevent parts of its
archive, sans cached revisions, and do the rest locally with
cachedrevs and rev-lib entries.    The bug queue has a feature request
for better sparse-library support and a pending merge that fixes a bug
in the cachedrev-ignoring-feature of the mirror command.

Mirroring that way means that my network traffic with the server is
reduced to a single instance of fetching each baseline changeset.
That's about as close to optimal as you should need to get most of the

Summary deltas, I worry, are going to not really solve the problem
(oh, it will work for some archives and access patterns, all right,
but pessimise others).   They'll be hard to control usefully and
slightly confusing.   They'll add plenty 'o new code to achieve that
dubious end.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]