[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] [forward] open letter to Linus

From: Tom Lord
Subject: [Gnu-arch-users] [forward] open letter to Linus
Date: Fri, 8 Apr 2005 20:44:33 -0700 (PDT)

  From: Tom Lord <address@hidden>
  To: address@hidden
  Subject: [GNU-arch-dev] open letter to Linus
  List-Archive: <>
  List-Help: <mailto:address@hidden>

I sent a letter to Linus about revision control issues and that letter
is beginning to be forwarded around a bit.  A few people from wildly
divergent places have expressed interest in it.  Therefore, I will
simplify things and make it an open letter.  Gotta love the ASCII-art,
at least, no?


Here is what I sent:

  Date: Fri, 8 Apr 2005 12:21:30 -0700 (PDT)
  From: Tom Lord <lord>
  To: address@hidden
  CC: address@hidden, address@hidden
  In-reply-to: <address@hidden> (message from Linus Torvalds on Thu, 7 Apr 2005 
08:54:27 -0700 (PDT))
  Subject: Re: speak of the devil...

You wrote:

   > what I like in an SCM is a good "fundamental idea". BK had
   > this. You could explain the basics of it's "world view" in a
   > fairly short time without ever actually talking about _what_ it
   > did. It's like UNIX - you can explain the "everything is a file"
   > notion together with the fork-based process model, and you
   > literally have a _philosophy_.


Hacker to hacker, here is some Arch philosophy, some how-best-to-use
Arch philosophy, and enough information about how Arch works that you
should be able to go off to a desert island and write your own
version, if so inclined.

All in a short (cough) note!

* Arch Atoms

  The atoms in the arch universe are *imports* and *changesets*.

  In other words, the basic things that programmers *do* from Arch's
  perspective all boil down to publishing imports and changesets and
  retrieving/using imports and changesets published by others.

  Arch --- and arguably any distributed, changeset-oriented revision
  control system --- acts like a big database in which a transcript is
  stored of all of the atoms (imports, changesets) published by Arch
  users.  Some subsystems of Arch are concerned with protecting and
  distributing this database; other subsystems implement higher-level
  queries; other subsystems build ancillary caches and memos to speed
  up access; finally, the subsystems which provide merging operations
  interpret the imports and changesets from the higher-level
  perspective of a software developer.

  This reductionist view of revision control (that everything boils
  down to these atoms) is a unifying principle.   Every part of
  Arch can be explained by reducing the explanation to these
  terms.  (Maybe this like starting to describe unix by saying
  "everything is a file".)

  From a low-level perspective, Arch is simply an application that 
  collects imports and changesets from users, publishes them, and 
  helps others retrieve those which have been published.  Imports and
  changesets are easily reified as simple files (tar bundles, patches)
  and so the low-level job of Arch -- managing these atoms -- looks a
  lot like more familiar forms of distributed, file-oriented
  databases: /NNTP/ or /DNS/, for example.   

* Atoms in a Space with a Coordinate System: Arch Names and Arch Archives

  Given that /Arch atoms/ are *imports* and *changesets*:

  /Arch space/ is a patchwork quilt of regions of space, each of
  which is an *Arch archive*.  Each region of space (archive) 
  contains a collection of Arch atoms.

  In practice, a fragment of Arch space is reified as an FTP site,
  HTTP site, or SFTP site.   Each fragment contains a set of named
  locations, each location containing an atomic import or changeset.

  The /Arch namespace/ is a standard coordinate system for Arch space.
  Each location where an import or changeset might be stored is
  assigned a *Fully Qualified Revision Name*.

  A user who publishes his own archive is defining the meaning of 
  arch names in the part of the arch namespace he has reserved.
  His physical archive contains the imports and changesets whose
  name he has assigned.

  Some example names:

    */fully qualified revision names/*

    *...for an import* address@hidden/tla--devo--1.3.2--base-0'

    *...for a changeset* address@hidden/tla--devo--1.3.2--patch-1'

  The detailed structure of these names is relatively unimportant
  here.   They comprise a five dimensional space of strings.  The
  internal structure of this five dimensional space is chosen for
  convenience: when high level revision control operations are
  specified in terms of manipulations of arch names, the definitions
  of those operations come out simple and regular.

** Arch is Relativistic

  Archives are fragments of Arch space and each contains a little map
  fragment assigning names to points in that space.   Each point 
  in space contains an import or changeset.

  Each user of arch edits his own *archive registry*: a specification
  of how to sew together a complete map of an Arch universe out of the
  individual fragments of space published by other Arch users.
  (Perhaps this is analogous to the traditional unix `/etc/hosts'

  Each user thus has a personal map of Arch space and the maps are not
  constrained to be identical.  Nevertheless, the maps of cooperating
  Arch users, while not identical, are logically consistent with one
  another.  This is one way in which Arch is "relativisitic".

  Each archive is a *write-once database*: a transcript, essentially,
  of what imports and changesets the owner of the archive has
  published there.  Imports and changesets are themselves passed
  around as ordinary files.  Logically, an archive behaves like an
  ascension-only filesystem.

  Consequently, it is cheap and easy to replicate arch archives with
  abandon and users do so regularly to optimize access.   I might
  publish an archive containing the GNU Arch mainline;  most users who
  are interested in my archive will immediately and then incrementally
  copy the HTTP site containing my archive to their local disk.

  The archive copying process is scheduled flexibly and in ad hoc
  ways: user's schedule it on their own.

  Therefore, in addition to each user having their own coordinate
  system for Arch space, each user has their own physical copies of
  many of the archives they use.   Since copying of data to these
  mirrors happens in ad hoc ways, different users may observe
  logically unrelated Arch events occur in different orders: another
  way in which Arch is relativistic.

  The relativisitic properties of Arch are again remeniscent of /NNTP/
  and /DNS/: each node has a differing view of an overall logically
  consistent universe of databases;  updates propogate throughout the
  network of nodes in asynchronos, non-deterministic ways.

* Illustration

  I publish an archive called address@hidden'.   It is my
  primary public archive.

  That archive contains imports and changesets and assigns names
  to each one.   As such, it is a fragment of Arch's space.

  The Arch command `abrowse' can draw a map of my fragment of Arch


      % tla abrowse --summary address@hidden


              tag of address@hidden/tla--myline--1.3--patch-2
              remove libawk/numbers.[ch]
              Remove "libawk/trim.[ch]"
              mostly fix bug `id-commands-confusing'
              [CORRECT patch-2] fix a compiler warning


              tag of address@hidden/tla--devo--1.3.1--patch-21
              add 'touch' command
              [CLOSE missing-status] whats-missing exit status options


  It takes a little effort to learn to read new kinds of maps and
  this document isn't designed to lead you through learning to read
  Arch maps.   Still, you can see here an outline-form map of a 4D
  subspace of the larger 5D namespace.  These kinds of maps are
  actually very useful when managing patch flow in the real world.

* From Atoms to Organisms -- Patch Flow

  Continuing the analogy: If imports and changesets are the atoms of
  the Arch universe, patch-flow circuits are the higher-order organisms
  which emerge out of the structured interactions of the atoms.

  There's a popular picture of how programmer's working on large busy
  projects trade changesets.   Here's an ascii-art version:

                              * ........... project mainline
                            /   \           (owned by the project
                           /     \            maintainer)
                         /  team   \
                       ^            ^
                      '              `
                     '                `
                    v                  v
                   *                    * .... remote supporting branches
                 /   \                /   \      (remote "lieutenants")
                /     \              /     \
               /remote \            /remote \
              / team A  \          / team B  \
              -----------          ------------

  The `*'-s mark published development lines, each maintained
  by it's own team.

  The structure is somehwat recursive/fractal: each team may break
  down similarly;  the depicted aggregate may be regarded as a single
  team in a larger graph.

  The structure may reflect ancillary software engineering practices:
  you could impose a testing or review requirement at one of the
  `*'-s, for example -- no patch flows along an arrow unless it passes
  the test/review barrier.

  The graph doesn't have to be a tree.  One might graph the situation
  among competing branches of Arch itself as this cyclic graph,
  containing some tree subgraphs:

                         \  A /
                          \  /
                            ^<..         /
                            '    `.   , branch
                            '      > {    C
                            '     .   `
                            v <..'      \
                          /  \
                        /   B  \

  To truly acknowledge the dirty truth of the matter, one must also 
  acknowledge back-door changeset-flow that Gets a Job Done but also
  violates the Platonic Ideal:

                         \  A /
                          \  /
                            ^<..         /
                            '    `.   , branch
                            '      > {    C
                            '     .   `
                             v<..'      \  ^
                           /\            ,-'
                          /  \       , -
                         branch    ,' (sneaky cherry-picking
                        /   B <--'    path -- messes up merging)

  If you were involved less with the kernel project but more with, for
  example, the production pipeline for an OEM OS release -- I'd draw
  similar-but-different graphs illustrating the patch-flow typical in
  that kind environment.  All the graphs come out the same: a forest
  of simple trees -- but with a few exceptions (non-tree-like edges),
  here and there, at interesting points on the map.

** Relating Patch Flow and Atoms

  What flows across the edges in those graphs is, in the Arch
  universe, Arch atoms: imports and changesets.  This leads
  to an insight that can turn our informal graphs into something very

* The Global History Graph

  Consider one of the simplest of all informal patch flow graphs; here
  is a picture of two programmers hacking:

                        \  A /
                         \  /
                         /  \
                       /   B  \

  If we make a transcript of the patch-flow between these two
  programmers, it might look like this:

        000: A imports tree X_a
        001: B accepts tree X_a making X_b
        002: A publishes a change against X_a making X_a1
        003: A publishes a change against X_a1 making X_a2
        004: B publishes a change against X_b making X_b1
        005: B accepts change X_a1 making X_b2
        006: B publishes a change making X_b3
        007: A merges all of B's recent changes, making X_a3

  We could construct a similar transcript for a more complicated patch
  flow graph.   Our restriction to just two programmers here only
  simplifies the examples.

  Such a transcript gives rise to a fairly obvious graph, using
  two flavors of edge to distinguish ancestry from merging:

    X_a --> X_b -->X_b1      /
     |              |        |
     v              v       /
    X_a1 *.......> X_b2    <    * (all three X_b* revisions)
     |              |       \   .
     v              v        |  .
    X_a2           X_b3      \ ,
     |                     ...'
     v        .............'
    X_a3 <...'

  That graph of the patch-flow transaction network has a natural
  isomorph in the form of a graph of Arch the atoms *imports* and


    [[X_a]] --> X_b -->X_b1      /
       |                |        |
       v                v       /
      X_a1 *.......>   X_b2    <    *
       |                |       \   .
       v                v        |  .
      X_a2             X_b3      \ ,
       |                      ,...'
       v        .............'
      X_a3 <...'

         [[X_a]] is an *import*

         all other nodes are *changesets*.
           The ORIG tree for each changeset
           is found by following the *solid* arrow
           backwards (`--->').   The other 
           changesets merged in are found by
           following the *dotted* arrows backwards

         /for example:/ `X_a3' is a changeset 
          recording the patch that turns the `X_a2'
          tree into the `X_a3' tree.

* Arch Topology Maps

  The final graph in the previous section can be viewed as a kind
  of topo-map of an Arch space.   The nodes (`X_a', `X_b2', etc.)
  are locations occupied by Arch atoms (the import `X_a', the others
  being changesets).

  There are two orthogonal kinds of gradient on these topo-maps: a
  2D space.   There is the *changeset ancestry* direction (indicated
  by solid arrows) and the *merge ancestry* direction (indicated by
  dotted arrows).

** Reifying Arch Topo-Maps

  Consider location `X_a3' in the graph above.  It is joined by two
  gradient arrows: one a changeset ancestry arrow, the other a merge
  ancestry arrow.

  The local position of `X_a3' on the map can be described in a 
  text file, using Arch names:

     revision: X_a3             # a name for this location
     ancestor: X_a2             # the changeset ancestor
     merges: X_b, X_b1, X_b2    # the merge ancestors

  When arch stores a new atom (publishes a changeset or import) it
  includes in the tree a text file that is more or less like that:
  it describes the local topology of Arch space in terms of changeset
  and merge ancestry.

  For efficiency and convenience, Arch *also* includes with each tree
  a (user-specified) collection of *other* topo-map fragments.  An
  arch client that wants to low the local lay-of-the-land around a 
  given tree can consult these text files.

** Accidental Hackable: Altering the Map

  Many key Arch operations (e.g., merging operations) consult the
  in-tree copy of the topo-map to decide what to do.

  The operations that do this are generally pretty clever, but the
  emphasis is on keeping them simple and regular -- hence predictable
  and controllable.   It is only within that constraint that
  perceived convenience becomes the absolute highest priority.

  Because the topo-map of imports and changesets used by some
  operations is stored in the tree being operated on, users can
  sometimes (quite usefully) modulate the behavior of these operations
  by altering the map.  ("Pretend this tree were over *there* instead
  of over *here* in the patch-flow graph; now perform merge *X*.").

* Smart Merging

  Consider the patch-flow diagrams earlier in this document.  They
  illustrate both tree-structured and arbitrary-digraph patch-flow.

  Changesets and imports flow along the edges of those graphs which
  implies that, within each node, merging operations take place.

  Patch-flow feedback exists within all of these graphs.  Consider one
  of the simpler ones, for example: 

                              * ........... project mainline
                            /   \           (owned by the project
                           /     \            maintainer)
                         /  team   \
                       ^            ^
                      '              `
                     '                `
                    v                  v
                   *                    * .... remote supporting branches
                 /   \                /   \      (remote "lieutenants")
                /     \              /     \
               /remote \            /remote \
              / team A  \          / team B  \
              -----------          ------------

  Team `B' sends changes to the in-house team which must merge those.
  The in-house team is generating changes of its own, many
  representing merges of work from teams `A' and `B';  team `B' will
  want to back-merge work from the in-house team.

  Merging in the face of patch-flow feedback is *hairy*.  In the fully
  general case of an arbitrarilly shaped patch-flow graph, there does
  not exist (given today's programming languages) a fully general
  solution.  Human intelligence is *required* to solve merging
  problems in the general case.   There is not enough information in
  today's program source texts to solve merging in the general case.

  Although the general-case problem of merging is hopelessly hairy,
  many special cases admit ample automation.

  Merging with feedback in a strictly tree structured patch-flow
  network is at the opposite extreme: even with today's programming
  languages and their syntaxes, nearly perfect merging can be *fully*
  automated.  (Merging still uses a text-line-based approximation but
  it is a very good approximation for most programming languages.)

  Merging in the face of feedback from cherry-picking varies in
  complexity depending on the nature of the changesets programmers
  choose to publish and cherrypick.  When programmers are able to
  concentrate on making clean, isolated changes cherrypicking too can
  be largely automated.  If programmers are undisciplined in their
  cherrypicking habits, merging devolves into a hopelessly manual

** Smart Merging Heuristics are Local Computations on Arch Topo Maps

  So, you have to trust me on this:  I can define (mathematically and
  in code) "smart" merge operations -- one for tree-shaped patch-flow;
  another for disciplined cherry-picking; a few more for other common
  pattners.  I can define these in terms operations on Arch atoms,
  referenced by Arch names.   Arch does indeed provide a rich
  collection of such merge operators.

  It turns out that, to do their work, these smart merge operators
  need to consult an Arch topo-map of the sort described above: they
  are sensitive to the graph gradients of changeset ancestry and merge

  It further turns out that these smart merge operators are mostly
  *locally* sensitive to the Arch topology:  they don't need a map of
  the whole world;  they only need maps of the immediate region around
  the trees and changes they are operating on.

  This is awefully convenient, in light of how topo maps are reified:
  Every changeset and import is accompanied by a little map fragment
  of its immediate position on the global topo-map.  Every tree has a
  user-controlled collection of topo-map fragments for adjacent
  regions.  The merge operators therefore have the maps they need at
  hand, as soon as they have the trees and changesets they are suppoed
  to operate on.

* Deploying Arch

  In general, deploying Arch means planning (or recognizing) the
  patch-flow graph to be supported; mapping that into a structure in
  Arch space; selecting all the tools needed and putting them into

  What would it mean to decide to use Arch for the kernel project?

  I think it would mean you and I have a little bit of hopefully
  pleasant work to do:

  /System Architecting/  It's basically a very subtle computing system
  archictecting job: designing a revision control infrastructure that
  will be anarchically and democratically administered and which must 
  support workflow among an already active, recently disrupted set of
  kernel hackers.

  The output of such a system architecting project is a mixture of
  installed systems -- the hub nodes of the infrastructure -- along
  with some documentation: e.g., a `HOWTO' saying how to make your own
  node if you want to prepare some kernel patches for submission;
  another `HOWTO' describing how to maintain a vendor branch if you
  are a Linux distributor; etc.

  The actually impelemented and deployed hub nodes would need to
  include, I guess, at minimum: workstation deployments for you and
  others at OSDL and other key locations;  a public server where your 
  Arch-published changes (and other formats derived from those)
  are made available;  client software custom-published for third
  parties that want to link-up to the Arch-based infrastructure;
  documentation for those third parties; and some labor dedicated to
  supporting all of that.

  Central to the design process would be understanding or imposing the
  particular patch-flow graph you want.  From that starting point, we
  could select components and prepare them for deployment at OSDL and

* implementation nutshell

  For curiosity sake:

  You get the idea that arch is an /nntp/ or /dns/-like distributed
  database of imports and changesets.

  I represent both imports and changesets as simple tar bundles of
  ordinary files.  Metadata is stored in ordinary text files using
  mindlessly simple formats.

  To produce an import, you need `tar' or something like it.

  To produce and apply a changeset, you need `diff' and `patch' but
  that work on whole trees and handle renames and so on.   I wrote
  `mkpatch' and `dopatch' which call `diff' and `patch' for low-level
  work but handle renames and so on themselves.   The way that renames
  are handled is designed to minimize the amount of "special action" 
  needed by the programmer -- you can usually just `mv' files in the 
  usual way and Arch will catch on.

  To manage an archive, i use a "dumb-fs" abstraction: you can put
  files, make directories, rename things, and delete things.  If one
  assumes that rename behaves as in Posix, a dumb-fs database for
  imports and changesets can have all the ACID properties of a good
  database.   (And, secondarilly, support for composite transactions
  in Arch -- committing to many distributed archives at once -- is on
  the way! :-)

  The "Arch topo-maps" are the key to smart merging.   I store them
  in text files, in trees -- with some redundent storage in archives
  to optimize access.   I include ample user-specified commentary
  fields on these maps so they double as "commit logs".

  Smart merging operators are too large a topic for this message but
  aren't that hard.   Tree-shaped patch-flow is a special case.  There
  are some other special cases.

  Raw Arch archives, being dumb filesystems containing tar bundles of
  imports and changesets, are optimized for *some* forms of access
  that come up but certainly not all.

  To obtain good performance, it is import to cache and memoize
  results computed from raw archive data, storing caches and memos
  persistently, on-disk, close to each client.

  This gives rise to a natural competition among arch hackers: who can 
  implement the best combination of caches and memos?

  My solution so far is brute force:  clients can maintain a memo
  of full-tree copies of each revision (each import tree and each
  changeset's TARGET tree).   Files unchanged between full copies are
  shared using hard-links.   The presense of these full trees, which
  includes the topo-map fragments they hold, speeds up merge
  operations amazingly and at a cost in disk space measurable in tens
  of dollars.

  Some other developers are currently working on fancier caches.
  There current results are mixed.

* Overall

  I hope I've conveyed that Arch is nothing more than the combination
  of a few simple, fundamental ideas.   It could be equally regarded
  as the best current option among free changeset oriented revision
  control systems /and/ a sane *framework* for future developments in
  revision control technology.


Gnu-arch-dev mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]