monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] RFE: Hard Barrier Between Branches in Netsync


From: Nathaniel Smith
Subject: Re: [Monotone-devel] RFE: Hard Barrier Between Branches in Netsync
Date: Sat, 3 Mar 2007 00:42:20 -0800
User-agent: Mutt/1.5.13 (2006-08-11)

On Sat, Mar 03, 2007 at 01:02:34AM +0100, Ulf Ochsenfahrt wrote:
> Nathaniel Smith wrote:
> >On Fri, Mar 02, 2007 at 03:37:50PM +0100, Ulf Ochsenfahrt wrote:
> *snip*
> >Note that the way netsync is currently set up, every new revision is
> >first sent without any branch info, and then the branch info is sent
> >for that revision.  So effectively every branch cert you send looks
> >like you are trying to steal permission to look at a pre-existing
> >revision.
> >
> >I'm not sure how the solve this -- I suppose each connection would
> >have to track in memory the complete set of revisions that read access
> >was granted to, and also all revisions that have been sent down this
> >connection?
> 
> Only if netsync keeps that ordering. If netsync orders revisions and 
> branch certs such that the relevant info is closer to each other, then 
> the receiving side can already write a bunch of stuff to the database. 
> That might make netsync more complicated though. I presume that the 
> revisions are already ordered such that the oldest stuff comes first?

Yes, the idea is that the connection can drop at any time, and
whatever we've received-to-date will leave us in a valid state.
(There have been questions at some point whether this is the best
possible design -- freer ordering of stuff being sent lets you
optimize your access pattern more -- but that's how it works now.)

We're certainly allowed to rearrange netsync, though.  We know that
what we have now needs to be mostly rewritten sooner or later
anyway...

> >I don't know if there are other similar security problems -- netsync
> >is complicated.  But I guess if you want this feature, you should
> >make sure you know that you have thought of everything :-).  Everyone
> >would be perfectly happy to see more capable access controls for
> >netsync, it's just not clear how to actually _get_ such a thing
> >without redesigning whole chunks.
> 
> Conceptually, I think there is an easy way to think about this: let 
> netsync simply not 'see' revisions that the other side doesn't have read 
> access to at the db abstraction layer, and block branch certificates for 
> branches that the other side doesn't have write access to. The only 
> problem is that you could end up with revisions without branch cert.

There isn't really any such abstraction layer ATM, but nod.

> >BTW, the sourceforge example is a red herring for other reasons -- all
> >our code assumes that network operations are generally database-wide,
> >so their efficiency tends to be O(whole database), not O(subsets of
> >database involved in this particular sync).
> 
> I can't argue with you here, but I was under the impression that the 
> merkle trie is only build for the branch pattern that is to be synced.

It is, but many algorithms in monotone, for instance, slurp the entire
history graph into memory (rather than loading it piecemeal as
necessary and taking an IO latency hit for each node one at a time).

> > This is not particularly
> >fixable -- trying to sieve out 100 megabytes of relevant data from a
> >multi-gigabyte (or multi-terabyte, for sourceforge) database is never
> >going to be fast; you need some kind of lower level data partitioning.
> 
> Google performs fairly well for multi-???byte queries, although I don't 
> know if that is a valid comparison.

Google also has the entire WWW cached in RAM -- so no, the comparison
is not entirely valid :-).  More importantly, they are not doing
queries against a giant, flat, general-purpose RDBMS.  That would be
totally impossible.  Instead, they have storage layout tuned to match
the algorithms they use, etc.  Splitting projects that are synced
separately into separate arenas (files), then only supporting general
purpose querying within those arenas, is exactly the sort of thing
Google does (in spirit, not in detail).

> On the other hand, if netsync is extended to handle multi-db syncs on a 
> single connection, wouldn't that solve that very same problem?
> With the downside that identical data can't be shared among dbs.

Note that you've just scaled up the size of the project you propose by
another factor of ten -- monotone's code is currently in no way set up
to let you access multiple databases within a single run.  (Again, not
saying that this means what you propose is impossible, just trying to
give you the information you need to make realistic assessments about
what to do...)

> >So in any kind of large hosting situation, you would certainly be
> >using some sort of vhost support and multiple databases anyway.  Also,
> >sourceforge probably wants a security model that is simple to analyze,
> >which netsync-based security is unlikely to be...
> 
> Is there any documentation on the database layout/netsync protocol?

Database layout: schema.sql and database.cc :-)
  (it's pretty straightforward, though, schema.sql should give you the
   idea)

Netsync protocol: there are some giant comments in netsync.cc that
  AFAIK are mostly up to date.  That and the code.

(To be honest, code is generally the clearest language for writing
such documentation anyway.)

-- Nathaniel

-- 
So let us espouse a less contested notion of truth and falsehood, even
if it is philosophically debatable (if we listen to philosophers, we
must debate everything, and there would be no end to the discussion).
  -- Serendipities, Umberto Eco




reply via email to

[Prev in Thread] Current Thread [Next in Thread]