[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] arch web pages
Re: [Monotone-devel] arch web pages
23 Aug 2003 13:05:10 -0400
Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
Tom Tromey <address@hidden> writes:
> I looked at the arch web pages a little. There are a couple
> interesting things there.
> This is a requirements list for a future gcc revision control system.
> It is probably pretty close to gcc consensus (if that exists).
ok. I can respond to these point-by-point, though I probably ought to
put up a page with this as a canned answer too:
- data integrity guarantees: in design, monotone is better than CVS
in this regard (transactional, distributed). in practise it is young
so will have some unexpected bugs. we're moving to self-hosting soon,
which will likely help shake things out.
- portability: monotone is probably not as portable as CVS, since it
is written in C++ and uses some modern features, but should be close
to "as portable" as g++ and boost, which is a fair number of modern
- end-to-end checksumming: monotone uses strong hashes for identifying
everything; you can't get much more checksummy than monotone.
- anonymous read-only access by UID with only read-only privs: doable.
"repositories" don't really exist much -- it's a distributed system
after all -- but both news servers and depots can be set up about as
read-only as you can imagine. even if someone does "write" where
they're not allowed, it does no damage to data integrity (which is
at the ends of the network, in RSA cert validation).
- remote write operations use strong crypto: there is no such thing as
a "remote write operation" in monotone, generally. but everything
that touches your database -- even local operations -- uses strong
crypto. integrity is as strong as SHA1-derived RSA signatures;
authority is distributed and client-evaluated.
- data cannot be modified by unpriviledged users without using the VC
system: well, it's a file, so you can twiddle its bits. but bit
twiddling will very likely be noticed by the endless hashing and
- must be at least as fast as CVS: depends on the operation. I'm
within an order of magnitude of local RCS when reconstructing file
versions; remote file access doesn't happen in monotone so there's
no other comparable algorithm to benchmark against. I expect to
close the RCS gap a bit more but it certainly "feels" pretty snappy,
since nearly every operation is local.
- efficient network protocol: all the networky stuff does
transmissions of size proportional to the deltas. the NNTP
transmission system is currently lockstep rather than pipelined, but
the HTTP depot stuff is effectively pipelined (one request + long
streaming send, each way)
- efficient tags and branches: a tag or branch-making command involves
adding a single fixed-size cert to your database. it's nearly
instant. transmitting it to another machine means transmitting a
few hundred bytes (cryptographic data after all).
- efficient delta storage: delta storage is currently done with
bring-to-front, so the retrieval time mirrors your access patterns
(say if you're working on a branch, those branch tips will move to
the front of the delta store, at a cost of possibly-redundant copies
of similar heads). but the storage system is totally decoupled from
all other metadata about ancestry or versions, so you can play with
the storage algorithm to suit your needs. so long as it can produce
a version with the right SHA1, it doesn't matter how.
- efficient method of extracting a logical change "after the fact":
yes. build any 2 manifests, take their setwise difference, fetch all
the deltas between files which changed in the manifest
difference. this is the standard way of doing every delta
- atomic application of logical change, one changelog msg: yes.
- atomic backout: not yet, but if I add it, sure. all it involves is
deleting the newly-committed manifest cert describing the state as a
descendant of its parent. I haven't written a 'backout' command yet
but there's no design reason not to have one. note, though, that
such a thing wouldn't back out a change from *other people's
databases*, if you've transmitted the change already, since the
system is distributed. I'm considering adding a "nullify" cert to
indicate "dumb mistake" nodes you wish to backout.
- renames: yes, though in an interesting sense. files don't have
permanent "inode"-like identities that last past their current
version. identification of files is done either by pathname
identification, or SHA1 identification, or explicit certs tying one
file to another. renaming is only really relevant when exploring
history to see what committers were intending; mechanical operations
like checkouts or updates don't really care whether the new file
version is a "renaming" or "creation", so long as it has the right
SHA1. when (or "if") monotone or the user fails to notice or
register a change as a rename, it just safely degrades to a
delete+add pair, and full file data is transmitted rather than a
delta, which isn't deadly.
- when merging branch A->B, remember last mergepoint and start from
there. yes, definitely, this is always how "monotone merge" works.
- single-delta merge: this is also called cherrypicking. I don't have
a "great" way of doing this now, but it's not outside the range of
things it's relatively easy to implement. in the worst case you can
diff the two tree states and pipe that to patch. it's not currently
as easy as saying "add patch 33, remove patch 45" though.
- perform conflict resolution by formation of microbranches:
yes. monotone makes no distinction between forks, conflicts and
branches, save that branches have *names* and are supposed to stay
forked, whereas forks and conflicts are intended to eventually
- should allow different users to generate patches vs. apply them, and
still smoothly function when the author updates: yes. using SHA1
values means a file is identified by contents; doesn't matter where
the contents came from. in fact, the model in monotone is even
stronger: the unpriviledged person generates the patch, and the
"priviledged" person (read: important, trusted) just generates a
cert which rubber-stamps the patch. then it is automatically applied
by anyone who trusts that rubber stamp. no "double-committing" stuff.
- efficient on-disk representation: I'm benchmarking against the GCC
repository. currently I am more space-efficient in delta storage,
but not as space-efficient overall, as CVS. more like 4 times the
size. however, most of that is relations between very verbose and
uncompressable cryptographic metadata; I think I can make it much
in any case, a major advantage monotone has in this area is that you
can work off of arbitrary subsets of a database without getting it
upset -- say "trim all but the last 2 years, and delete everything
related to ada or fortran" -- and carry that around with you on
disk. or say "aggregate pre-2.95 versions into large, sparse,
per-release deltas", and work off that. versions are just SHA1
codes, and the database is relational; there is no need for
continuity or completeness in each database.
- generating ChangeLog entries: doable with a lua hook. not presently
there, but easy enough to add.
LINUX KERNEL CRITERIA (excluding stuff already mentionned):
- advanced merge conflict tool: monotone does a merge3 from the least
common ancestor (last mergepoint) and can drop you into an external
merge tool (via lua hook) of your choice if that fails. it knows how
to invoke ediff and xxdiff. I haven't written my own GUI for this.
- remote branch repositories: not clear what this means, but monotone
is fully distributed and branches can be made by anyone, at any time,
on any machine, connected or disconnected.
- per-file checkin comments: if you like, yes. you can attach changelog
certs to file versions, manifest versions, or both.
- storage of select inode metadata: you can extend the cert vocabulary
with anything you like (links, pipes, devices, owners, ...) but
you'll have to add hooks to interpret these values since they have
system-specific meaning. the default file metadata vocabulary is a
"portable interpretation" of merely file pathnames and their
- "dontcommit file.c" to mark a private change: .. uh .. doable, but
it's completely a UI issue. I haven't added that. is it really
- disconnected / distributed repositories: yes.
- ability to exchange changesets by email: yes, by any transport.
- patch splitting: eh.. perhaps. not obvious what to consider a
splittable entity. if you can select inbetween-version-codes, I can
certainly split edges on those. if not, it's not clear where to put
the boundary. if you *do* split a patch / changeset though, the
system will automatically identify the endpoints of the "one big
patch" with the endpoints of "lots of little bitty patches", since
they have the same SHA1 either way you arrive there. so splitting
or aggregating doesn't break other people's work.
- archival of directories: no. I don't archive empty directories. I
archive pathnames of files. directories are implied by file
pathnames. I didn't feel like it was terribly worth writing code to
managing the coming and going of empty directories -- do you really
want your version to be considered different from mine just because
it has an extra *empty* directory? -- but in theory it can be added
with no fuss, just don't see a strong reason to care.
- a magical bk usage story about smooth and easy pushing and pulling
and exchanging stuff with linus: not sure. I haven't used it with
linus yet :) the theory is that this sort of scenario will work, but
who knows about the practise? let's try.
> I've talked to Graydon a bit about merging. I suppose these different
> things in arch -- star-merge, replay -- are just different ways of
> deciding how to apply patches when merging. I think that could all be
> done, in theory.
suppose we have P=parent, US=our working copy, OTHER=some other
change. "arch update" applies diff(P,US) to OTHER and writes the
result into US. "arch replay" applies diff(P,OTHER) to US. as near as
I can tell "star-merge" does a 3-way merge using the fact that US and
OTHER share P as a parent.
monotone always does a 3-way merge when it can find a parent,
regardless of branch boundaries or anything, else it does a 2-way
merge. 3-way merge is the generalization (and correction) of both
"replay" and "update" described above: it means taking X=diff(P,OTHER)
and Y=diff(P,US), adjusting all the coordinates of edits in Y so that
they are made in terms of the coordinates after X, and applying the Y
to OTHER. I don't know why tom lord has chosen to implement update
operations using weaker merge operators when there's a known parent;
replay is strictly *more* likely to fail than a merge3, since it's
attempting to apply patches to blocks of data which may be in new
places. maybe with unidiff context matching and a sloppy "patch"
program (hashing lines, accepting fuzz-factors) it will often work,
but why? you have the parent; you ought to use it.
anyways, if you develop a magical better merge operator, I don't think
it'll be hard to wedge it into monotone. I already have a lua hook for
handling a failed merge3. I will likely add one for overriding the
initial attempt at merge3 and supplying your own, if you've a
preference (eg. a ChangeLog merger or something). since you're always
doing merge work on your local database, you should feel reasonably
comfortable tinkering with this stuff; it won't disrupt other users'
if you play with custom merge operators on your own.
> Does monotone handle file permissions and symlinks well? Those are
> actually useful to handle.
no, it doesn't. by default I didn't want to add an interpretation of
these, as they don't strike me as either (a) terribly important or (b)
terribly portable (they may have variable semantics on different
platforms). maybe it's not a hard thing to add -- either some new
certs or a change to the manifest format -- but my aim is to err on
the side of simplicity at this stage. same reason I'm not handling
empty directories at the moment. I don't see it as "in general
demand". feel free to add this if you think it's a big feature.