monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cvs branch reconstruction (was Re: [Monotone-devel] Re: big repositories


From: Nathaniel Smith
Subject: cvs branch reconstruction (was Re: [Monotone-devel] Re: big repositories inconveniences (partial pull?))
Date: Thu, 24 Aug 2006 19:20:47 -0700
User-agent: Mutt/1.5.12-2006-07-14

On Thu, Aug 24, 2006 at 03:24:18PM +0200, Markus Schiltknecht wrote:
> Nathaniel Smith wrote:
> >My memory of the discussion before is not that it was rejected for not
> >being like cvs2svn.  Just, if you're making up your own algorithm,
> >we'd like to see a description and justification of it so we have a
> >chance to apply some of the collective brain power here to making sure
> >it makes sense.  Because, well, I doubt _anyone_ is smart enough to
> >invent a complete and correct CVS reconstruction algorithm without
> >some help noticing where they forgot nasty edge cases :-).  (Certainly
> >I'm not.)
> 
> Maybe. However, I don't feel like making up my own algorithm for that. I 
> just thought maybe this change I did could already be sufficient. But I 
> know not it's not. So I will try to do something closer to what cvs2svn 
> does.

Okay.

> >And, to make the process a little easier, cvs2svn is a very good place
> >to look, because they've done a lot of that work to find all the
> >approaches that _don't_ work already, so hopefully we could piggyback
> >on that.
> 
> ..yeah, I have already included their design-notes.txt into the 
> repository (uh... is that license compatible at all?) and added my own 
> comments about how I did it for mtn cvs_import.

Hmm... it looks like SVN has an advertising-like clause in its license
(WTF?), plus a "you may not use the word 'tigris' in your project
name" requirement that goes way beyond trademark law, so no, it's
probably not GPL-compatible.

This isn't a huge problem, because it's not like we're going to
compile design-notes.txt and link it with the rest of our code anyway
:-).  (Though don't copy/paste from it into source code comments.)  It
would probably still be better not to redistribute it in the monotone
distribution -- perhaps by the time this lands on mainline,we could
just provide a link, along with our own description of what we do?
(We probably want one of those anyway, that isn't spending all its
time talking about what formats intermediate data files are written
out in, and how the 'sort' program is called...)

Of course, if it's useful to mark up a copy on the branch, feel free
-- there's definitely no need to write up a whole document on how
things work, before we have even figured out how things work :-).

> >I am a bit curious about this sql table for tracking them, though;
> >it doesn't make a whole lot of sense to me at first glance.  There's
> >some question about storing it on disk in the first place --
> 
> We need to store some information on disk to help speed up later 
> resyncs.

Ah, I see, this is for incremental re-imports -- I missed that part.
You may want to decide to either get incremental imports working
first, or get branch reconstruction working first, and start out
concentrating on just one of them.

cvs2svn doesn't write things to disk so it can support incremental
re-imports.  (IIUC, it doesn't support them at all.)  It writes them
to disk so, if an import is interrupted (like by a power failure or
something), you can restart it.  That's what I assumed you were trying
to achieve by writing this thing to disk.  (And this is the goal that
doesn't seem very important to me, at least at this stage.)

> I'm not sure if it's this RCS version <-> file_id mapping which 
> helps most. Of course as it's a separate table (as is) a resync could 
> only happen on the database which also did the very first import.

I'm not sure either.  Again, Christof is probably the one to talk to;
my understanding is that he has a scheme for storing this information
in monotone certs, and though this is really ugly, no-one's managed to
come up with a better scheme after many months of trying.

> >everything else cvs_import does is in-memory, which might not be
> >ideal, but it hasn't seemed to cause any problems yet, and fixing it
> >will take more than moving one single data structure onto disk.  
> 
> Why not? What more does it take? Do you want to have such information 
> netsynced to other repositories?

If the goal is to be able to incrementally re-import, netsyncing with
other repositories would definitely be handy :-).  This is one of the
reasons that Christof's cvssync works the way it does.

However, I just meant that _if_ you wanted to be able to resume a
failed cvs_import (and apparently you don't), you would have to move
the existing data structures we use to disk, not just this one new
one.

> >More
> >than that, though, it seems unlikely that a file_id<->rcs number
> >mapping is what you're actually looking for?
> 
> Like I said, I don't know.
> 
> >Recall that a file_id simply identifies a bitstring -- it does not
> >correspond uniquely to any particular "file" in any particular
> >revision.  In fact, a given revision may contain many files, that all
> >have the same file_id (because they happen to have the same content).
> 
> Aha. And from a file_id you cannot get the filename, then? So this 
> should better be called 'stream_id'?

Yeah.  The name is definitely a bit confusing.  (It's _slightly_ more
meaningful than something like 'stream_id', because it does
specifically state that the bitstring in question is being used as file
data, rather than, say, manifest data -- maybe the really correct name
would be id_for_a_stream_that_is_used_as_a_file, or something like
that.)

> >Similarly, a rcs number is not useful on its own; every rcs file has
> >some revision numbered 1.1, for instance... unless we're somehow
> >mashing the rcs filename and the rcs version number together into a
> >single string in this table, I don't see how it can be useful?
> 
> Yeah, we probably need the filename, too.
> 
> Like I said, it's just my 'scratch pad' thing. And it 'works' - at least 
> so far as it does write out the (to some extent useless) RCS -> file_id 
> mapping.

Sure.  I don't mean to stop your playing around -- I generally go
through all sorts of horrible, stupid designs in order to discover one
that's actually usable :-).  I just realized that there are some
traps here (like what a 'file_id' actually means), so I figured I
might be able to save you a bit of time by pointing them out now,
instead of letting you discover them on your own :-).

-- Nathaniel

-- 
In mathematics, it's not enough to read the words
you have to hear the music




reply via email to

[Prev in Thread] Current Thread [Next in Thread]