[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] the state of the union

From: Tom Lord
Subject: Re: [Gnu-arch-users] the state of the union
Date: Wed, 18 Aug 2004 15:23:00 -0700 (PDT)

    > From: Greg Hudson <address@hidden>

    > [Disclaimer: Invading Subversion developer here.]

[Also, one of the svn project's chief database thinkers and doers
 for the past couple of years, as I understand it.]

    >> In the early days, when we collectively didn't know much, there was
    >> an actual debate to be had about how to implement file orientation:
    >> Does it require a fancy delta-compressed version-transactioned
    >> filesystem (Subversion)?  or can you get by with low-tech
    >> brute-force techniques combining compressed archives with
    >> client-side caches (Arch).  The debate is over.  We know the answer.
    >> The answer is that the Subversion approach can deliver lower
    >> command-line latency for some operations but that Arch delivers
    >> lower administration costs, better scalability, and higher
    >> throughput.

    > I think you may be prematurely closing the debate.  Please review
    > <> and, if your interest runs that
    > deep,
    > <>.

I am aware of but haven't yet explored in depth your fsfs work.  I
really do want to, actually, just haven't taken the time quite yet.

Arch user's generally, not just me, should have a look at it,
particularly while considering the question of whether or not it is
the right system or the right set of ideas to use for what we've been
calling "delta-compressed revlibs".  Alas, it is probably a bit tricky
to follow ghudson's notes without a little bit of familiarity with the
"table designs" implicit in the svn fs API and explicit in the BDB-FS
back end.

I don't think that you're perpetuating the debate I claim is closed: I
think you are helping confirm that the debate is closed.   

In this thread, when I was talking about the trade-offs in
per-file-optimized storage for revision control, I was comparing svn's
BDB-FS to arch's combination of dumb-fs archives and client-side
caches and memos.   I was comparing the relatively "low tech" arch's
virtues of simplicity, maintainability, good speed in most situations,
and very low admin costs to the higher-tech bdb back end.   I'm
comparing them in a broad engineering sense, considering not just how
they work but what it has taken to impelement them and what it is like
to administer them.

That you feel there is need for FSFS is, I think, a sympotom of the
problems with the BDB approach.   You point out yourself, for example, 
on that first web page, various ways in which FSFS simplifies
administration compared to BDB.

Just because the BDB approach has all of these problems doesn't mean
that the only choice would be for "everyone go work on arch".  Yes, I
think it *is* a bug in the community to keep the various free revctl
projects so separate, this late in the game, when it is clear from
many examples that they are all headed to roughly the same place ---
but that isn't the debate I was talking about in the quote above.
The quote above is just one reason why that question about cooperation
is not, in this case, an idle or immaterial theoretical concern but is
a real practical question.

    > The salient point is that svn's back end can, like arch's, run over a
    > networked filesystem.  Although there is no application-level support
    > in Subversion for treating HTTP/FTP/scp services as dumb file
    > transport and running an FSFS repository over them, that seems like an
    > implementation detail.  (To date, there hasn't been any demand for
    > such a feature.)  So a back end of svn's design can, at least in
    > principle, be as easy to administer as arch purportedly is and scale
    > as well as arch purportedly can.

As long as you are basing your transactions on simple uses of `rename'
and avoiding local filesystem things like `flock' then it is trivially
true that you can operate on a dumb fs over just about any fs-ish
transport that hasn't been implemented *too* incorrectly.

(And again, that you are building such a back end confirms, doesn't
refute what I'm saying about debates being over.)

One thing I noticed while skimming the FSFS design document
("structure") is that some of the files in your back-end are
indefinately mutable (the one that caught my eye was something about
"revision properties", I believe).

Mutable files like that complicate replication, backups, and integrity
checking, at least.

One virtue of arch's approach is that the core archive is, in essense,
a (partially ordered) transaction journal and nothing more.   Each
commit-like operation bundles up the parameters of its transaction,
stores that bundle in the archive --- and that's it, the commit is

While pessimal (speed wise) for many (though, surprisingly, not all)
kinds of access patterns, the txn-journal approach dovetails nicely
with the idea of constructing ex-repository ancillary databases to
optimize for the underserved access patterns.  Core archives, being
only incrementally appended journals, are easy and efficient to
replicate, archive, monitor for tampering or accidental corruption,
etc.  Client-side caches and memos are a flexible solution that scales
arbitrarily with the number of clients.  Our txn journal entries are
in the ballpark of optimally small: a little bit of bandwidth out of a
dumb-fs arch archive can serve a large number of clients.

There's one reason why I think it might be helpful for some archers to
dig in to thoroughly grock your FSFS: we might be able to use some of
those ideas for "client-side caches and memos".

    > (In reality, Subversion's user base is generally happy to run a
    > server; the safety benefits of not allowing commiters to corrupt the
    > repository generally outweigh the security benefits of reusing an
    > existing server code base.  But that's neither here nor there.)

I don't see any safety benefits for most users -- just the opposite.

Many svn users will be their own admins -- the commiter is very much 
invited to hork the database.

Dedicated admins running a BDB-FS server can have an easy time of
mismanaging BDB logs and finding themselves up the creak when recovery
is needed.

Worst of all, though, how are your users supposed to _recognize_ when
a Subversion archive has been corrupted?  I mean sure, if commands
actually start failing then it's pretty obvious but, otherwise, it's
just a kind of Orwellian situation where you can't be too sure that
history hasn't been mucked with.  (Another nice virtue of txn-logs as
core archive: they're static and so they can be signed once and
verified indefinately.)

Of course, security benefits aren't the only reason we like using
venerable dumb-fs servers for arch.  They also tend to be tiny,
simple, and lightweight.  That way we don't wind up overcommitted to
some huge code base we wind up not liking.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]