[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gnu-arch-users] rev. ctl and file systems
[Gnu-arch-users] rev. ctl and file systems
Fri, 11 Apr 2008 13:48:44 -0700
Thunderbird 184.108.40.206 (X11/20060808)
So, the general question is "should revision control be
built-in at the storage level; should it be part of the file system".
Here is a more complete description of my current best guess
as to a good answer. This answer is already present in
the design (and partially, in the implementation) of
Flower contains patent pending technology, I am obligated
to remind people, at this juncture. (The source code at
the link uses a license that the FSF describes as a free
software license and that the OSI has approved as an open source
license. It also helps to protect the value of any patents I
As I said, "yes," loosely speaking, I see the next logical step
as being to build revision control deep into the storage level.
First "why" and then "how":
"Why" is more than just a matter of convenience. It's a
matter of necessity. The reason is because of the rising
importance of portable, personal computing and because
of the rising importance of distributed collaboration.
When we carry around computers, and especially when we
sometimes "work off-line" but then "sync-up", in effect we
are (by hand) simulating a distributed file system.
When two of us do that at the same time, remote from
one another but working on copies of the "same files" that
later we want to sync up, we are (by hand) simulating
a revision control operation.
Even without "off-line" operation: if two of us edit the same
documents stored on the web, at the same time (as in Wikis) then,
again, we need revision control functionality.
Because these kinds of activities are more and more the common
case (at least for personal communication or collaborative content),
we are reaching a situation in which we really "want" distributed,
decentralized revision control for pretty much *all* of our
There is a group of researchers who, for many years, have worked
on what I would call "conventional" approaches to (perhaps global-scale)
distributed file systems. An early cite might be, for example, the
Andrew file system (aka AFS). An interesting observation is that as they
have wrestled with the problem of scaling upwards to global scale, and
coping with networks that sometimes "partition" during a netsplit
or that simply can be very slow -- they too have (long since) discovered
that distributed, decentralized revision control is the only way to go.
So, the need for this at the storage level is indicated from top to
bottom: from what users need all the way down to what implementations
require in order to work at all.
That's enough "why" for now.
Let's talk "how".
Laurent, your message set out to nicely explore a design space
and try to map out the game tree there. That's a good approach.
But.... let's take one step back here.
Most of what you are talking about is ways to put revctl functionality
into a unix-like file system. We could imagine some future
"Linux ext5 file system" that has these new capabilities.
I doubt that that is the right approach, though the reasons are a
little unfamiliar to many:
First, memory (RAM) is inexpensive. Second, network bandwidth
is tending to go up but, network latency can only go so far.
Let's look at latency. Rough distance from the San Francisco Bay
Area, where I live, to Boston, where the GNU project lives, is
Ignoring all medium and switching costs, the best possible latency
between me and Boston exceeds 14 milliseconds. In reality, it
will always be much worse than that. In contrast, the latency
cost of a system call on a local PC is something we typically measure
Why does latency matter to the design of a unix-like file system?
Because the traditional unix API for files encourages random access,
short reads and writes, etc. It is on the basis of those properties that,
for example, MySQL or Berkeley DB can reasonably run *atop* a
unix file system rather than having to go to a "raw disk". The unix
API makes it possible and natural; the low (local) latency makes it
practical (enough). But raise that latency by an order of magnitude
and, suddenly, the API no longer makes practical sense.
Why does memory matter? Because that gives us an alternative: we
can do more work locally in RAM (even nvram or otherwise locally
persistent store) and, instead of using a unix-like API, just try to
read and write large chunks, infrequently.
Another practical insight here is that Unix's meta-data standards
and transactional capabilities are anemic for todays needs; it's
indexing capabilities all but non-existent.
So, when it comes to "how" my inclination is to re-think what we
mean by "file".
The W3C's emerging architecture gives us a very natural answer:
a "file" is (roughly speaking) the kind of thing you GET in an HTTP
reply or PUT in an HTTP request. Simplifying only a little, we can
say that a "file" is, therefore:
1. An envelope, containing arbitrary XML meta-data.
2. A primary payload, containing arbitrary XML data.
3. One more (possibly multi-media) "attachments" -- additional
My concept of a multi-forked, multi-media file here is one that
might remind you a bit of, for example, the original Macintosh
Given that insight, "how":
Well, to make a long story short my thought (in Flower) is to
leverage database technology like Berkeley DB / DBXML for
storage, transactions, and indexing --- and then to build revision
control into the API for accessing that new kind of store.
That's one reason I hesitate before investing more in Arch 2.0:
I wonder if we can't render traditional unix file systems "obsolete
legacy" within 5-10 years.
Make some sense?