[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnu-arch-users] [forward] open letter to Linus
From: |
Tom Lord |
Subject: |
[Gnu-arch-users] [forward] open letter to Linus |
Date: |
Fri, 8 Apr 2005 20:44:33 -0700 (PDT) |
From: Tom Lord <address@hidden>
To: address@hidden
Subject: [GNU-arch-dev] open letter to Linus
List-Archive: <http://lists.seyza.com/pipermail/gnu-arch-dev>
List-Help: <mailto:address@hidden>
List-Subscribe:
<http://lists.seyza.com/cgi-bin/mailman/listinfo/gnu-arch-dev>,
<mailto:address@hidden>
I sent a letter to Linus about revision control issues and that letter
is beginning to be forwarded around a bit. A few people from wildly
divergent places have expressed interest in it. Therefore, I will
simplify things and make it an open letter. Gotta love the ASCII-art,
at least, no?
-t
Here is what I sent:
Date: Fri, 8 Apr 2005 12:21:30 -0700 (PDT)
From: Tom Lord <lord>
To: address@hidden
CC: address@hidden, address@hidden
In-reply-to: <address@hidden> (message from Linus Torvalds on Thu, 7 Apr 2005
08:54:27 -0700 (PDT))
Subject: Re: speak of the devil...
You wrote:
> what I like in an SCM is a good "fundamental idea". BK had
> this. You could explain the basics of it's "world view" in a
> fairly short time without ever actually talking about _what_ it
> did. It's like UNIX - you can explain the "everything is a file"
> notion together with the fork-based process model, and you
> literally have a _philosophy_.
so:
Hacker to hacker, here is some Arch philosophy, some how-best-to-use
Arch philosophy, and enough information about how Arch works that you
should be able to go off to a desert island and write your own
version, if so inclined.
All in a short (cough) note!
* Arch Atoms
The atoms in the arch universe are *imports* and *changesets*.
In other words, the basic things that programmers *do* from Arch's
perspective all boil down to publishing imports and changesets and
retrieving/using imports and changesets published by others.
Arch --- and arguably any distributed, changeset-oriented revision
control system --- acts like a big database in which a transcript is
stored of all of the atoms (imports, changesets) published by Arch
users. Some subsystems of Arch are concerned with protecting and
distributing this database; other subsystems implement higher-level
queries; other subsystems build ancillary caches and memos to speed
up access; finally, the subsystems which provide merging operations
interpret the imports and changesets from the higher-level
perspective of a software developer.
This reductionist view of revision control (that everything boils
down to these atoms) is a unifying principle. Every part of
Arch can be explained by reducing the explanation to these
terms. (Maybe this like starting to describe unix by saying
"everything is a file".)
From a low-level perspective, Arch is simply an application that
collects imports and changesets from users, publishes them, and
helps others retrieve those which have been published. Imports and
changesets are easily reified as simple files (tar bundles, patches)
and so the low-level job of Arch -- managing these atoms -- looks a
lot like more familiar forms of distributed, file-oriented
databases: /NNTP/ or /DNS/, for example.
* Atoms in a Space with a Coordinate System: Arch Names and Arch Archives
Given that /Arch atoms/ are *imports* and *changesets*:
/Arch space/ is a patchwork quilt of regions of space, each of
which is an *Arch archive*. Each region of space (archive)
contains a collection of Arch atoms.
In practice, a fragment of Arch space is reified as an FTP site,
HTTP site, or SFTP site. Each fragment contains a set of named
locations, each location containing an atomic import or changeset.
The /Arch namespace/ is a standard coordinate system for Arch space.
Each location where an import or changeset might be stored is
assigned a *Fully Qualified Revision Name*.
A user who publishes his own archive is defining the meaning of
arch names in the part of the arch namespace he has reserved.
His physical archive contains the imports and changesets whose
name he has assigned.
Some example names:
[[blockquote
*/fully qualified revision names/*
*...for an import* address@hidden/tla--devo--1.3.2--base-0'
*...for a changeset* address@hidden/tla--devo--1.3.2--patch-1'
]]
The detailed structure of these names is relatively unimportant
here. They comprise a five dimensional space of strings. The
internal structure of this five dimensional space is chosen for
convenience: when high level revision control operations are
specified in terms of manipulations of arch names, the definitions
of those operations come out simple and regular.
** Arch is Relativistic
Archives are fragments of Arch space and each contains a little map
fragment assigning names to points in that space. Each point
in space contains an import or changeset.
Each user of arch edits his own *archive registry*: a specification
of how to sew together a complete map of an Arch universe out of the
individual fragments of space published by other Arch users.
(Perhaps this is analogous to the traditional unix `/etc/hosts'
file.)
Each user thus has a personal map of Arch space and the maps are not
constrained to be identical. Nevertheless, the maps of cooperating
Arch users, while not identical, are logically consistent with one
another. This is one way in which Arch is "relativisitic".
Each archive is a *write-once database*: a transcript, essentially,
of what imports and changesets the owner of the archive has
published there. Imports and changesets are themselves passed
around as ordinary files. Logically, an archive behaves like an
ascension-only filesystem.
Consequently, it is cheap and easy to replicate arch archives with
abandon and users do so regularly to optimize access. I might
publish an archive containing the GNU Arch mainline; most users who
are interested in my archive will immediately and then incrementally
copy the HTTP site containing my archive to their local disk.
The archive copying process is scheduled flexibly and in ad hoc
ways: user's schedule it on their own.
Therefore, in addition to each user having their own coordinate
system for Arch space, each user has their own physical copies of
many of the archives they use. Since copying of data to these
mirrors happens in ad hoc ways, different users may observe
logically unrelated Arch events occur in different orders: another
way in which Arch is relativistic.
The relativisitic properties of Arch are again remeniscent of /NNTP/
and /DNS/: each node has a differing view of an overall logically
consistent universe of databases; updates propogate throughout the
network of nodes in asynchronos, non-deterministic ways.
* Illustration
I publish an archive called address@hidden'. It is my
primary public archive.
That archive contains imports and changesets and assigns names
to each one. As such, it is a fragment of Arch's space.
The Arch command `abrowse' can draw a map of my fragment of Arch
space:
[[tty
% tla abrowse --summary address@hidden
[...]
tla
tla--devo
tla--devo--1.3.1
base-0
tag of address@hidden/tla--myline--1.3--patch-2
patch-1
remove libawk/numbers.[ch]
patch-2
Remove "libawk/trim.[ch]"
patch-3
mostly fix bug `id-commands-confusing'
patch-4
[CORRECT patch-2] fix a compiler warning
[...]
tla--devo--1.3.2
base-0
tag of address@hidden/tla--devo--1.3.1--patch-21
patch-1
add 'touch' command
patch-2
[CLOSE missing-status] whats-missing exit status options
[...]
]]
It takes a little effort to learn to read new kinds of maps and
this document isn't designed to lead you through learning to read
Arch maps. Still, you can see here an outline-form map of a 4D
subspace of the larger 5D namespace. These kinds of maps are
actually very useful when managing patch flow in the real world.
* From Atoms to Organisms -- Patch Flow
Continuing the analogy: If imports and changesets are the atoms of
the Arch universe, patch-flow circuits are the higher-order organisms
which emerge out of the structured interactions of the atoms.
There's a popular picture of how programmer's working on large busy
projects trade changesets. Here's an ascii-art version:
[[tty
* ........... project mainline
/ \ (owned by the project
/ \ maintainer)
in-house
/ team \
-------------
^ ^
' `
' `
v v
* * .... remote supporting branches
/ \ / \ (remote "lieutenants")
/ \ / \
/remote \ /remote \
/ team A \ / team B \
----------- ------------
]]
The `*'-s mark published development lines, each maintained
by it's own team.
The structure is somehwat recursive/fractal: each team may break
down similarly; the depicted aggregate may be regarded as a single
team in a larger graph.
The structure may reflect ancillary software engineering practices:
you could impose a testing or review requirement at one of the
`*'-s, for example -- no patch flows along an arrow unless it passes
the test/review barrier.
The graph doesn't have to be a tree. One might graph the situation
among competing branches of Arch itself as this cyclic graph,
containing some tree subgraphs:
[[tty
\branch/
\ A /
\ /
\/
^<.. /
' `. , branch
' > { C
' . `
v <..' \
/\
/ \
branch
/ B \
]]
To truly acknowledge the dirty truth of the matter, one must also
acknowledge back-door changeset-flow that Gets a Job Done but also
violates the Platonic Ideal:
[[tty
\branch/
\ A /
\ /
\/
^<.. /
' `. , branch
' > { C
' . `
v<..' \ ^
/\ ,-'
/ \ , -
branch ,' (sneaky cherry-picking
/ B <--' path -- messes up merging)
]]
If you were involved less with the kernel project but more with, for
example, the production pipeline for an OEM OS release -- I'd draw
similar-but-different graphs illustrating the patch-flow typical in
that kind environment. All the graphs come out the same: a forest
of simple trees -- but with a few exceptions (non-tree-like edges),
here and there, at interesting points on the map.
** Relating Patch Flow and Atoms
What flows across the edges in those graphs is, in the Arch
universe, Arch atoms: imports and changesets. This leads
to an insight that can turn our informal graphs into something very
precise:
* The Global History Graph
Consider one of the simplest of all informal patch flow graphs; here
is a picture of two programmers hacking:
[[tty
\hacker/
\ A /
\ /
\/
^
.
.
v
/\
/ \
hacker
/ B \
]]
If we make a transcript of the patch-flow between these two
programmers, it might look like this:
[[tty
000: A imports tree X_a
001: B accepts tree X_a making X_b
002: A publishes a change against X_a making X_a1
003: A publishes a change against X_a1 making X_a2
004: B publishes a change against X_b making X_b1
005: B accepts change X_a1 making X_b2
006: B publishes a change making X_b3
007: A merges all of B's recent changes, making X_a3
]]
We could construct a similar transcript for a more complicated patch
flow graph. Our restriction to just two programmers here only
simplifies the examples.
Such a transcript gives rise to a fairly obvious graph, using
two flavors of edge to distinguish ancestry from merging:
[[tty
X_a --> X_b -->X_b1 /
| | |
v v /
X_a1 *.......> X_b2 < * (all three X_b* revisions)
| | \ .
v v | .
X_a2 X_b3 \ ,
| ...'
v .............'
X_a3 <...'
]]
That graph of the patch-flow transaction network has a natural
isomorph in the form of a graph of Arch the atoms *imports* and
*changesets*:
[[tty
[[X_a]] --> X_b -->X_b1 /
| | |
v v /
X_a1 *.......> X_b2 < *
| | \ .
v v | .
X_a2 X_b3 \ ,
| ,...'
v .............'
X_a3 <...'
Key:
[[X_a]] is an *import*
all other nodes are *changesets*.
The ORIG tree for each changeset
is found by following the *solid* arrow
backwards (`--->'). The other
changesets merged in are found by
following the *dotted* arrows backwards
(`*...>')
/for example:/ `X_a3' is a changeset
recording the patch that turns the `X_a2'
tree into the `X_a3' tree.
]]
* Arch Topology Maps
The final graph in the previous section can be viewed as a kind
of topo-map of an Arch space. The nodes (`X_a', `X_b2', etc.)
are locations occupied by Arch atoms (the import `X_a', the others
being changesets).
There are two orthogonal kinds of gradient on these topo-maps: a
2D space. There is the *changeset ancestry* direction (indicated
by solid arrows) and the *merge ancestry* direction (indicated by
dotted arrows).
** Reifying Arch Topo-Maps
Consider location `X_a3' in the graph above. It is joined by two
gradient arrows: one a changeset ancestry arrow, the other a merge
ancestry arrow.
The local position of `X_a3' on the map can be described in a
text file, using Arch names:
[[tty
revision: X_a3 # a name for this location
ancestor: X_a2 # the changeset ancestor
merges: X_b, X_b1, X_b2 # the merge ancestors
X_b3
]]
When arch stores a new atom (publishes a changeset or import) it
includes in the tree a text file that is more or less like that:
it describes the local topology of Arch space in terms of changeset
and merge ancestry.
For efficiency and convenience, Arch *also* includes with each tree
a (user-specified) collection of *other* topo-map fragments. An
arch client that wants to low the local lay-of-the-land around a
given tree can consult these text files.
** Accidental Hackable: Altering the Map
Many key Arch operations (e.g., merging operations) consult the
in-tree copy of the topo-map to decide what to do.
The operations that do this are generally pretty clever, but the
emphasis is on keeping them simple and regular -- hence predictable
and controllable. It is only within that constraint that
perceived convenience becomes the absolute highest priority.
Because the topo-map of imports and changesets used by some
operations is stored in the tree being operated on, users can
sometimes (quite usefully) modulate the behavior of these operations
by altering the map. ("Pretend this tree were over *there* instead
of over *here* in the patch-flow graph; now perform merge *X*.").
* Smart Merging
Consider the patch-flow diagrams earlier in this document. They
illustrate both tree-structured and arbitrary-digraph patch-flow.
Changesets and imports flow along the edges of those graphs which
implies that, within each node, merging operations take place.
Patch-flow feedback exists within all of these graphs. Consider one
of the simpler ones, for example:
[[tty
* ........... project mainline
/ \ (owned by the project
/ \ maintainer)
in-house
/ team \
-------------
^ ^
' `
' `
v v
* * .... remote supporting branches
/ \ / \ (remote "lieutenants")
/ \ / \
/remote \ /remote \
/ team A \ / team B \
----------- ------------
]]
Team `B' sends changes to the in-house team which must merge those.
The in-house team is generating changes of its own, many
representing merges of work from teams `A' and `B'; team `B' will
want to back-merge work from the in-house team.
Merging in the face of patch-flow feedback is *hairy*. In the fully
general case of an arbitrarilly shaped patch-flow graph, there does
not exist (given today's programming languages) a fully general
solution. Human intelligence is *required* to solve merging
problems in the general case. There is not enough information in
today's program source texts to solve merging in the general case.
Although the general-case problem of merging is hopelessly hairy,
many special cases admit ample automation.
Merging with feedback in a strictly tree structured patch-flow
network is at the opposite extreme: even with today's programming
languages and their syntaxes, nearly perfect merging can be *fully*
automated. (Merging still uses a text-line-based approximation but
it is a very good approximation for most programming languages.)
Merging in the face of feedback from cherry-picking varies in
complexity depending on the nature of the changesets programmers
choose to publish and cherrypick. When programmers are able to
concentrate on making clean, isolated changes cherrypicking too can
be largely automated. If programmers are undisciplined in their
cherrypicking habits, merging devolves into a hopelessly manual
process.
** Smart Merging Heuristics are Local Computations on Arch Topo Maps
So, you have to trust me on this: I can define (mathematically and
in code) "smart" merge operations -- one for tree-shaped patch-flow;
another for disciplined cherry-picking; a few more for other common
pattners. I can define these in terms operations on Arch atoms,
referenced by Arch names. Arch does indeed provide a rich
collection of such merge operators.
It turns out that, to do their work, these smart merge operators
need to consult an Arch topo-map of the sort described above: they
are sensitive to the graph gradients of changeset ancestry and merge
ancestry.
It further turns out that these smart merge operators are mostly
*locally* sensitive to the Arch topology: they don't need a map of
the whole world; they only need maps of the immediate region around
the trees and changes they are operating on.
This is awefully convenient, in light of how topo maps are reified:
Every changeset and import is accompanied by a little map fragment
of its immediate position on the global topo-map. Every tree has a
user-controlled collection of topo-map fragments for adjacent
regions. The merge operators therefore have the maps they need at
hand, as soon as they have the trees and changesets they are suppoed
to operate on.
* Deploying Arch
In general, deploying Arch means planning (or recognizing) the
patch-flow graph to be supported; mapping that into a structure in
Arch space; selecting all the tools needed and putting them into
action.
What would it mean to decide to use Arch for the kernel project?
I think it would mean you and I have a little bit of hopefully
pleasant work to do:
/System Architecting/ It's basically a very subtle computing system
archictecting job: designing a revision control infrastructure that
will be anarchically and democratically administered and which must
support workflow among an already active, recently disrupted set of
kernel hackers.
The output of such a system architecting project is a mixture of
installed systems -- the hub nodes of the infrastructure -- along
with some documentation: e.g., a `HOWTO' saying how to make your own
node if you want to prepare some kernel patches for submission;
another `HOWTO' describing how to maintain a vendor branch if you
are a Linux distributor; etc.
The actually impelemented and deployed hub nodes would need to
include, I guess, at minimum: workstation deployments for you and
others at OSDL and other key locations; a public server where your
Arch-published changes (and other formats derived from those)
are made available; client software custom-published for third
parties that want to link-up to the Arch-based infrastructure;
documentation for those third parties; and some labor dedicated to
supporting all of that.
Central to the design process would be understanding or imposing the
particular patch-flow graph you want. From that starting point, we
could select components and prepare them for deployment at OSDL and
elsewhere.
* implementation nutshell
For curiosity sake:
You get the idea that arch is an /nntp/ or /dns/-like distributed
database of imports and changesets.
I represent both imports and changesets as simple tar bundles of
ordinary files. Metadata is stored in ordinary text files using
mindlessly simple formats.
To produce an import, you need `tar' or something like it.
To produce and apply a changeset, you need `diff' and `patch' but
that work on whole trees and handle renames and so on. I wrote
`mkpatch' and `dopatch' which call `diff' and `patch' for low-level
work but handle renames and so on themselves. The way that renames
are handled is designed to minimize the amount of "special action"
needed by the programmer -- you can usually just `mv' files in the
usual way and Arch will catch on.
To manage an archive, i use a "dumb-fs" abstraction: you can put
files, make directories, rename things, and delete things. If one
assumes that rename behaves as in Posix, a dumb-fs database for
imports and changesets can have all the ACID properties of a good
database. (And, secondarilly, support for composite transactions
in Arch -- committing to many distributed archives at once -- is on
the way! :-)
The "Arch topo-maps" are the key to smart merging. I store them
in text files, in trees -- with some redundent storage in archives
to optimize access. I include ample user-specified commentary
fields on these maps so they double as "commit logs".
Smart merging operators are too large a topic for this message but
aren't that hard. Tree-shaped patch-flow is a special case. There
are some other special cases.
Raw Arch archives, being dumb filesystems containing tar bundles of
imports and changesets, are optimized for *some* forms of access
that come up but certainly not all.
To obtain good performance, it is import to cache and memoize
results computed from raw archive data, storing caches and memos
persistently, on-disk, close to each client.
This gives rise to a natural competition among arch hackers: who can
implement the best combination of caches and memos?
My solution so far is brute force: clients can maintain a memo
of full-tree copies of each revision (each import tree and each
changeset's TARGET tree). Files unchanged between full copies are
shared using hard-links. The presense of these full trees, which
includes the topo-map fragments they hold, speeds up merge
operations amazingly and at a cost in disk space measurable in tens
of dollars.
Some other developers are currently working on fancier caches.
There current results are mixed.
* Overall
I hope I've conveyed that Arch is nothing more than the combination
of a few simple, fundamental ideas. It could be equally regarded
as the best current option among free changeset oriented revision
control systems /and/ a sane *framework* for future developments in
revision control technology.
-t
_______________________________________________
Gnu-arch-dev mailing list
address@hidden
http://lists.seyza.com/cgi-bin/mailman/listinfo/gnu-arch-dev
- [Gnu-arch-users] [forward] open letter to Linus,
Tom Lord <=