gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] arch lkml


From: Tom Lord
Subject: Re: [Gnu-arch-users] arch lkml
Date: Wed, 10 Dec 2003 14:20:55 -0800 (PST)


    Eric:

I'll reply with perhaps more than you wanted to know :-)

    >>> The one very obvious potential issue I see with arch as it
    >>> currently stands is that it does not use one of the more
    >>> sophisticated storage formats for storing deltas.  Which means
    >>> as your archive size increases the work can increase.  I think
    >>> with a different backend format cacherev would not be
    >>> necessary.  But I may be wrong.

I don't think that you are "wrong", per se -- just that you're worried
about solving imaginary/outdated problems.

The classic (so-called:) "sophisticated" formats (e.g., SCCS) come to
us from a distant time (by industry timescales).  There were designed
to solve a problem that arose in:

  (a) a non-networked environment
  (b) an environment in which disk-space was precious

and so consequently they:

  (1) [re: (a)] assume computation close to the repository for 
      every operation

  (2) [re: (b)] are aimed at balls-to-the-wall optimized compression
      of revisions rather than just a comperable "big-O" (loosely
      speaking) optimization

When those formats (or their more modern design-thinking copy-cats
like svn) are used in a more modern environment, that means:

  (i) [re: (1)] they need to introduce smart servers and thus create
      both a performance bottleneck and a single point of failure --
      obstacles that can be overcome only with ridiculous amounts of
      complexity (which of course creates new problems)

  (ii) [re: (2)] they create systems which: flop over dead upon a 
       media failure;  add absurd complexity (hence problems) to try
       to optimize "common case" access patterns;  aren't 
       easily extensible, and so forth.

Here is a better way to understand the arch format:

* separation of archive host and archive client

  We recognize that the general case is that that these two hosts are
  different and that when performance becomes really interesting, 
  it's going to be a matter of clients vastly outnumbering hosts.

  Therefore, we want to push as much work as possible to the client
  side.


* minimization of network bandwidth

  We recognize that the important case is of a developer who
  is active over a prolonged period of time in some project whose
  mainline archive is hosted remotely.   We also recognize that the
  internet is basically slow.

  Therefore, we want to minimize network bandwith primarily, and 
  latency secondarily, between our developer and his project.

  It is damn hard to beat compressed changesets for that purpose
  other than by increments.

  To be sure, arch's push-mirror could do better on latency issues --
  it should eventually work more like rsync and _maybe_ even ask for a
  little help from server-side computation.

  But in general, for our important-case developer, arch makes 
  network performance about as close to irrelevent as it can be.


* serendipitous archive-as-payloads

  So we're going to have long-lived client sites that pull
  updates or push updates to or from some archive host.

  In fact, we'll have _many_ such clients for our most interesting
  hosts.

  One idea is that we should, archive-side, just cache the payloads 
  that we'll send to these clients -- so that when a client needs some
  "thing" we just find it on disk and stream it out.

  It turns out, then, that if instead of "caching" we simply "memoize" 
  those payloads ("caching" can discard stuff -- "memoizing" holds on
  to it forever) that the memo we create contains, in a _reasonably_
  compact form, all of the information we would have in an archive.

  So: the memo _is_ the archive.  Disregarding archive-cached
  revisions for the moment, an arch archive _is_ just a memo of the
  payloads that a client might ask for.   

  That's dirt-simple to implement, allows arch to use "dumb servers",
  and is the same "big-O" space complexity as the more sophisticated
  formats.  


* rely on client-side caches, but keep it simple, stupid

  Client-side, our interesting developer is going to have some 
  pretty interesting access patterns -- focused mainly on recent
  revisions, semi-random-access within those.

  Revision libraries provide that in, again, a dirt-simple way that
  has just about optimal performance characteristics.   The recent
  hacks for --hard-links, library paths, sparse libraries and so 
  forth just make that sweeter and sweeter:  our interesting developer
  can these days (using the head revision at least -- you'll see it in
  the next release) check out a tree just as fast as he can hard-link
  to it.   Whole tree comparisons take, in the expected case, the 
  time of `find' traversals of the two trees plus the time to diff the
  files that actually changed.

  In terms of _time_ complexity, this romps all over the old
  "sophisticated" formats and their modern descendents.

  In terms of _space_ complexity, it's awefully damn competitive with
  the "sophisticated" formats, especially when you multiply the
  difference between them by the price-per-byte of disk storage.



* cache-rev is for patzers

  Archive cached revisions pretty much don't matter to our
  interesting-case developer -- they mostly help the more casual
  users of arch archives.   They're a courtesy in that regard.
  If you're working with some project, make a local mirror, pass
  --no-cached to push-mirror, and make sure you set up your libraries 
  well.   Arch would _arguably_ be better off not even bothering with
  cached revs.

  I exaggerate slightly:  if your ongoing relationship to a project
  is that you want to keep current with the head revision but don't
  care about keeping the history of it locally (perhaps you are a
  "tester"), then cached-revs (and, later, cached-summary-deltas) are
  a win for you.   But these things really aren't of central
  importance to arch.


A big theme in the preceding is "dirt simple".  An example of what
that implies can be seen in recent discussions about "archive
signing".  There's a minor crisis going on the free software world
because project hosts keep getting cracked and people want to be sure
that their source archives are not corrupted -- hence an interest in
signing.  Contrast recent discussions on the dev@ list for svn with
those on arch.  Here in arch-land we're ~30% done implementing signing
after just a few days -- I could (an not officially, but ...)  commit
to having it up and running by the end of the month.  Now read the
dev@ for svn (and think through the performance and complexity
implications of their most rapid-route suggestion of a hook-based
answer).   Meanwhile, we're not worrying about "how to possibly do it"
over here -- but "how to do it in any of the several possible ways
while not screwing up the security concerns and while keeping a
long-term view of the architecture."


    > As I understand the literature recent work on version control has
    > used what is a variation on the gzip format for storing multiple
    > versions.  The idea is you compress the first file like normal.
    > But for the second and subsequent files you look back into your
    > archive (which you are simply appending to) and use previous text
    > for compression.  This makes both appends and random access fast
    > and in addition this happens to work for random binary files.

arch is not heavily optimized for binary files, that's true.   I've
yet to see any evidence that this matters all that much for software
development and given today's disk economics.

    > Thinking about the implications of this I wonder why no one has built
    > a general purpose archiver like with a structure like that.  I should
    > give roughly the same compression ration as tar | gzip while still
    > giving you the random access properties of zip. 

    > So it looks like the very practical thing to do is to build a general
    > purpose, compressed, random access file archiver, which can be used
    > for backups or file archiving whichever makes most sense.  And then
    > come back and look at improving arch.

_Possibly_.  It's not obvious that it'll be worth it (nor that it
won't).  But if you are going to go that route, google around for Josh
MacDonald first.

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]