monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] partial pull #2 - gaps instead of a single horizon


From: Nathaniel Smith
Subject: Re: [Monotone-devel] partial pull #2 - gaps instead of a single horizon
Date: Tue, 29 May 2007 23:49:32 -0700
User-agent: Mutt/1.5.13 (2006-08-11)

On Tue, May 29, 2007 at 02:42:03PM +0200, Markus Schiltknecht wrote:
> Nathaniel Smith wrote:
> >I assumed that in the normal case, one would do a partial pull to
> >fetch the last chunk of history, but then only append to that going
> >forward -- i.e., partial pull databases still grow over time at
> >exactly the same rate as other databases, they just start out smaller.
> >That seems like what most users would want, anyway.
> 
> Yes, that's probably sufficient for most people. But how do you explain 
> them that partial pull works only once? That they have to throw away 
> what they currently have, just to fetch the newest, bleeding edge revision?

...I don't think we're communicating, because I have no idea what
you're talking about :-).  Obviously I am not being clear either, so
let me lay out my understanding again...

In my world, the reason we need partial pull is that the total history
size of a project grows without bound.  Therefore, for very large and
old projects (Linux kernel, *BSD, Mozilla, gcc, glibc, maybe a few
others), the full history database may be many times larger than a
checkout.  It is unreasonable to expect new developers to, before
writing their first patch, download several gigabytes of data.
However, even for such projects, the actual rate of new history being
added is not *too* high, the problem comes from the long history.
And, if I am following such a project, it is not unreasonable for me
to each week download whatever happened that week.  So incremental
updates are bounded and not a problem, just the initial pull size
grows without bound.

So my imagined use case is that a new developer says
  mtn clone netsync://mtn.project.org --restrict-last 1000
which fetches only revisions up to depth 1000 from all heads, and sets
the horizon to be whatever revisions have depth exactly 1000.

Later, they want to pull as normal, so they just do
  mtn pull
and this fetches all new revisions since their initial pull.  The
horizon does not move.

They may wish to at some point do
  mtn pull --restrict-last 2000
to fetch more history.  This asks the server what the new horizon
should be, moves the horizon there, and fetches intermediate stuff.
(It also effectively forces a full regenerate_rosters.)

> >Not that this is really relevant to the question of gaps, because
> >a rolling history window is still a contiguous history window.  I
> >might not be understanding what you mean.
> 
> I'm thinking of the result of a partial pull as a repository having a 
> gap between the root and the horizon. In that sense, such a repository 
> it's not a single contiguous history window, because it also has a root. 
> You would have to _replace_ the root with the sentinel revision ids to 
> get a repository to be a contiguous history window.

By root do you mean the first commit (which in a partial pull we don't
have), or the magic root revision [] (which doesn't actually exist)?
How is this discontiguous?

(By contiguous I mean in particular the property that if we have
revision A and revision B, and A is an ancestor of B, then we have all
revisions that are _between_ A and B.  Put another way, the contents
of a database should always be a convex set.  Convex sets turn out to
be totally an awesome concept -- see the new uncommon ancestors code
for another example...)

In my version, sentinels basically become roots.

> >Anyway, from reading this thread so far, I'm not at all convinced
> >that gaps are useful, or anything other than heinously complex to
> >implement and document.  Ordinary single-horizon partial pull is
> >conceptually straightforward, because all of our code _already_ knows
> >how to deal with the horizon -- right now the "horizon" is hard-coded
> >to always be "the root of the graph", but you can handle a lot of
> >cases by just going through and tweaking all the loops that stop when
> >they see the magic [] revision, to instead stop when they see a
> >revision in the magic db.get_horizon() list.  I have no similar
> >intuition about how to implement gaps; it seems like they'd need a
> >whole pile of new machinery everywhere.
> 
> I disagree and claim the opposite: gaps, represented as ordinary 
> revisions, are easier to implement, as all the code already knows how to 
> deal with them. ;-)

Only if they _are_ ordinary revisions.  Unfortunately, history
representations are not a place where picking a representation at
random and hoping turns out to work very often :-(.

Arbitrary-number-of-parent revisions are one thing; they at least make
sense.  How are you planning to create synthetic revisions for
arbitrary numbers of revisions on the "bottom" end of the gap?  Note
that to express the right lifecycle and mark semantics between these
revisions, you may need to postulate arbitrarily many extra
intermediate synthetic revisions in the middle... I think someone gave
an example downthread?

> There is an important differences between the root and a sentinel: a 
> sentinel has a valid revision id and it can be replaced with real 
> revision data, while the root can not. That's why we should print 
> something for the user, when we hit a sentinel, but not when hitting the 
> root.

Yeah -- I agree things like log, annotate, etc. should have some
special handling for when they fall off the end of known history.  I
was just hoping that special handling could be added incrementally,
and would mostly involve printing an extra message or something, not
needing to alter the actual algorithms.

> My reasoning was: if we are going to implement sentinels for gaps 
> between a horizon and the root, why not do it right and add support for 
> gaps of any range? It's only marginally more work, while being a much 
> more general solution.

Generality is only good if it is for a purpose.  I'd still like to see
some use case for why I would want to have history from 1990-1992,
2000-2001, and 2005-present together in a database.

> >As for their utility, the most compelling use case seems to be that
> >they could be used for the "obliterate" case ("the judge said we can't
> >distribute this code anymore").  Are gaps really the right thing,
> >though?  I don't have a great intuition here, but it seems to me that
> >in the cases where you do have to delete such code, it's often stuff
> >that was around for a while undetected, but restricted to a single
> >part of the codebase -- so you don't want to delete your entire
> >history between versions 1.1 and 1.3, you want to delete just that one
> >file across that range.  And gaps won't help with that anyway.
> 
> Agreed.
> 
> How about hot servers? Or multiple partial pull? Or later reduction of 
> your local repository size?
> 
> Most people, me included, do not understand why they absolutely must 
> have the complete history of their repositories on all of their working 
> machines. Gaps give us the freedom to choose how much data we want to 
> carry - not only for an initial pull, but at any point in time.
> 
> >Am I missing something?
> 
> Uhm.. probably not, but some code can't help... I've just checked in 
> what I currently have. Have a look at n.v.m.gaps, it includes a test 
> doing a successful checkout, log and a log --diffs on a partial 
> repository. Providing meaningful output for the missing revision.

...but code is a strong argument, I certainly don't have any :-).

-- Nathaniel

-- 
Eternity is very long, especially towards the end.
  -- Woody Allen




reply via email to

[Prev in Thread] Current Thread [Next in Thread]