Re: Goals for repo conversion day

emacs-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Goals for repo conversion day

From:	Eric S. Raymond
Subject:	Re: Goals for repo conversion day
Date:	Sun, 26 Jan 2014 19:33:12 -0500
User-agent:	Mutt/1.5.21 (2010-09-15)
Eli Zaretskii <address@hidden>:
>                                       Why this doesn't worry you,
> and why you still refuse to accept that maybe, just maybe, this is a
> lot of effort for a relatively small gain, is beyond me.  If this is
> in any way indicative of the other problematic issues of the
> conversion, then "Houston, we have a problem", indeed.

What I refuse to accept is doing a job that is below my standards of
quality, if I'm going to do it at all.  You cannot argue me out of
that by telling me it's too much work, because I simply don't accept
that as a valid reason to settle for slipshod results.  Instead, I
upgrade my tools.

> I found that at least these ones are missing:
> 
>   lisp/ChangeLog.15 references 103083
>   lisp/ChangeLog.16 references 103471 and 107149
>   src/ChangeLog.12 references 104015 and 103913

Thank you for finding these.  This is a useful bug report.

To illustrate my methods, I fixed this by adding those revnos to the 
ChangeLog section of the map file I enclosed in my last mail (it is
the file FOSSILS in my conversion directory).  Then I ran a Python
script called 'decorate.py' that patched in the corresponding action
stamps.  The point is that I didn't have to do the lookup by hand; the
fixup took less time to do than to describe.

The map that decorate.py uses is in turn generated by a second script,
bzrlog2map, that filters the putput of bzr log --levels 0 into an
association between revnos and action stamps.  Here are the first
few lines:

116082  2014-01-20T16:55:address@hidden
116081  2014-01-20T16:47:address@hidden
116080  2014-01-20T08:52:address@hidden
116079  2014-01-20T08:45:address@hidden
116078  2014-01-20T08:15:address@hidden
116077  2014-01-20T07:56:address@hidden
116076  2014-01-20T01:21:address@hidden
116075  2014-01-20T00:54:address@hidden
116074  2014-01-19T16:59:address@hidden
116073  2014-01-19T15:42:address@hidden
116072  2014-01-19T13:28:address@hidden
115426.2.11     2014-01-19T13:27:address@hidden
115426.2.10     2014-01-19T13:26:address@hidden
115426.2.9      2014-01-19T12:42:address@hidden
115426.2.8      2014-01-18T00:24:address@hidden
115426.2.7      2014-01-18T00:24:address@hidden

This is MAP in my conversion directory; I rebuild it occasionally
to be sure new revs are included,

The point of having two maps rather than one is this: at some point
I'm going to mechanically compile FOSSILS into a list of reposurgeon
commands.  For example, this:

ChangeLog:
        revno 108687 -> 2012-06-22T21:17:address@hidden

will become something like this 

=B & [ChangeLog] filter --replace /\brevno 
108687\b/2012-06-22T21:17:address@hidden/

That command translates into English as: Over the set of all blobs in
the history with paths containing the string 'ChangeLog', replace
'revno 108687' (preceded and followed by breaking characters) by its
corresponding action stamp.

I could, in theory, generate a humongous and guaranteed-exhaustive set
of these commands directly from MAP. If I did that, though, the
conversion day script might would many hours to run, most of that
spent on generated commands that are no-ops. There could also be
unhappiness related to revision numbers short enough to false-match
numeric tokens that are nothing of the kind.

Instead, FOSSILS both drives and documents the minimum set of changes
required.  The cost is that I have to maintain the list of source
tokens to be replaced partly by hand.  This is normal and acceptable;
I often deal with similar issues in Subversion repositories.

> It sounds like the scripts or methods you are using to find such
> references are not catching some of them.  E.g., bare numbers, without
> any leading "r" or "revno:" etc. are mostly (or maybe completely)
> missing.

Looking at bzrlog2map, I see you're right.  One of my to-do items
was to add to it a scanner that would turn up likely reference-string
candidates.  I forgot I hadn't actually done that yet.

> Given this quality, I once again question the need for all this work.

That is incoherent.  Whether the work is needed has *nothing* to do with
whether it is well implemented yet.

> If we cannot guarantee coverage very close to 100%, what would be the
> value of such a partial conversion?

Exactly proportional to the coverage, of course.  Every single
reference that is easily chased by human eyeball or indexing tool
(e.g. *not* a cookie that is meaningless because its context is gone)
increases the utility of the conversion.  Complete transparency of
reference is best; more is better than less; partial is better than
none.

The history is too messy for us to get 100% coverage (too many
external CVS references), but that is not an argument that we should
settle for zero.

>                                    More importantly, do we have
> reasonably effective methods of QA for the results?  The omissions I
> discovered are based on simple bzr commands followed by manual
> inspection (to avoid quite a few false positives); unless we can come
> up with better ways that don't involve manual labor, the overall
> quality will not be high enough, as manual labor is inherently error
> prone.

This is why I explained my workflow.  Once a reference has been identified
and put in FOSSILS, none of the remaining steps are vulnerable to human
error.  (My scripts could have bugs, of course.  But they're not very
complex, so we can have reasonably high confidence in them.)

> Btw, what about references to repositories of other projects?  Here's
> one example (from trunk):
> 
>     revno: 110764.1.388
>     committer: Bastien Guerry <address@hidden>
>     branch nick: emacs-24
>     timestamp: Tue 2013-01-08 19:49:37 +0100
>     message:
>       Merge Org up to commit 4cac75153.  Some ChangeLog formatting fixes.
> 
> Are we going to replace the git sha1 here by something more universal?

No, because there is no notation and no resolution protocol for such
references.  If there were such a thing, I would be right on top of using it.  
Actually, if there were such a thing, it would more than likely have been my
invention to begin with...

> If so, there's much more work around the corner; if not, why does it
> make sense to insist on doing that for Emacs's own branches?

Because that *can* be done, and every successful internalization adds utility
by (a) removing an impediment to browsing and (b) documenting a causal link.

> See above: this is just the tip of the iceberg.  I think you will find
> much more of such references, with Org, CEDET, MH-E, and Gnus being
> the most frequent ones.  Doesn't leaving those out of this conversion
> undermine the goal?

Yes, of course it does.  Don't let the unachievable perfect be the enemy of the 
achievable good!

(Damn, now you've started me thinking about prefixing action stamps
with name lookups to a registry of repositories.  If I invent a
practical solution to this it's going to be partly your fault...)

> I thought a "changeset" was well defined in the context of a VCS.

In modern VCSes like Bazaar, hg, and git, yes, it is a well-defined
concept.  This conversion creates confusing cases for two reasons.  One
is the vagaries of CVS; the other is ChangeLog entries, which carry
some of the semantic freight of VCS changesets without having the
atomicity and time-locality properties that they automatically have when
the VCS actually implements them.

The result is that one Emacs/Zaretskii "changeset" usually corresponds
to one modern VCS changeset, but not always. When the correspondence breaks
down, one Emacs/Zaretskii "changeset" maps to two or more VCS changesets,
one of which is likely to be a Changelog entry that is semantically
bound to the others but a singleton changeset that the VCS doesn't know
is connected to them.

> My definition is a set of changes made as part of working on a single
> isolated issue.  IOW, what would have constituted a single indivisible
> commit with our current procedures.

The Bazaar portion of the history isn't the problem, the CVS part is.
There are many instances in the CVS part of the Emacs history that
look something like this:

1. Eli changes file A and commits it
2. Eli changes file B and commits it with an identical change comment.
3. Eric changes file C and commits it
4. Eli commits a ChangeLog entry describing the A and B changes
5. Eric commits a ChangeLog entry describing the C changes

In your terms, there are two changesets here: {1,2,4} and {3,5).
But when parsecvs runs, the result will probably look like this:

Changeset 1 - {1,2}
Changeset 2 - {3}
Changeset 3 - {4}
Changeset 4 - {5}

Changesets 1 and 3 don't get joined because the intervening commit 
prevented parsecvs from recognizing that they should be coalesced.

(Actually the behavior is a little better than this: parsecvs did
coalescence by branch, so if commit 3 is on a different branch than
1 and 2 the right thing will happen.)

Here's where the vagaries of CVS come in. For various stupid random
CVS-is-brain-damaged reasons there may have been enough skew between
the recorded commit times of 1 and 2 that *they* don't get coalesced,
even though that's what notional-Eli intended.

*That* kind of defect (eligible commits that didn't fit inside too
small a time window) is what reposurgeon was originally designed to 
fix.  These are very, *very* common in crappy CVS lifts, and reposurgeon
can fix them automatically.

There is another case common in the Emacs history that can be
coalesced.  That is: a file modification immediately followed by a
ChangeLog change describing it - but with an empty change comment on
the ChangeLog change, which parcecvs refuses to consider matching to
anything else.  These do have to be fixed up by hand.  I haven't tried
yet.

> From a cursory look I had at the current git mirror, no coalescing was
> done there.  But perhaps I'm missing something; Andreas, can you
> please comment on this?

Look for commits that predate the Bazaar transition but change multiple
files. You'll find parsecvs made those.

> Can we take a real-life use case, please?  Please show the cliques
> produced by your analysis in this range of bzr revisions on the trunk:
> 39997..40058.  You can see the details with these bzr commands:
> 
>  . This will show a 1-line summary for every revision in the range:
> 
>      bzr log --line -r39997..40058
> 
>  . This will show the full commit messages and other meta-data of a
>    single revision, 40000 in the example (can also be used with a
>    range -rNNN..MMM):
> 
>      bzr log --long --show-ids -c40000
> 
>  . This will show the files modified/added/deleted by a single
>    revision (can also be used with a range -rNNN..MMM):
> 
>      bzr status -c40000
> 
> The above range of revisions shows a typical routine of commits when
> Emacs was using CVS; in particular, "*** empty log message ***" are
> most probably ChangeLog commits which usually followed commits of the
> files whose log entries are in the ChangeLog change.  Note that the
> commit messages are almost always different (they are actually the
> ChangeLog entries for the files being committed), although the changes
> belong to the same changeset.  Also note how commits by different
> people working on separate changesets sometimes overlap, as in
> revisions 40033..40038.
> 
> How will these be handled during your proposed conversion?  And what
> will be the commit messages of the coalesced commits?

I think the example I showed above explains most of this.  I'd have to grovel
through all the timestamps to find out if automatic coalescence would catch
any of the cliques in your span, but I can say that (for example) this:

40050: Miles Bader 2001-10-19 *** empty log message ***
40049: Miles Bader 2001-10-19 Exit if we can't find some variable.

looks like something the "lint" command in reposurgeon would catch. I
would then eyeball it to check that 40050 is the changelog tweak 
describing 40049 and write something like this into the lift script:

<40049>..<40050> squash --pushback

The effect would be to merge 40050's Changelog fileop into 40049, which
would keep its comment.  The children and parents of the sequashed
commit would be what you think.

And yes, <40049> would be a legal commit reference in reposurgeon.  Provided
I did this first:

read fossils <MAP

which is the other use of the MAP file I described previously.

> > In a properly done conversion, file ignores don't abruptly stop working
> > bevcause you browsed back past the point of conversion and what should
> > be .gitignore files are nmow .bzrignores or .cvsignores.
> 
> So you will be adding .gitignore to revisions where there was none?
> If not, how do you plan on attacking this issue?

By converting .bzrignore files in place to .gitignores.

> If you really want to build confidence in your methods and tools, some
> kind of statistics about the conversion jobs done using them, and the
> time passed since the conversion would probably be a good start.

I can tell you the most important statistics.  For three years of 
doing conversions on projects including GPSD, NUT, Hercules, Roundup,
Battle For Wesnoth, robotfindskitten, groff, and several others,
I can tell you three numbers:

1. Time passed since conversion: tops out at 3 years for GPSD, about 2
years each for NUT and Hercules.

2. Number of defects I found myself after delivering a final
conversion: three. (All in Battle For Wesnoth. Two CVS usernames
didn't get properly mapped to git-style IDs because the attribution
file I was using at conversion time was incomplete.)

3. Number of defects subsequently reported by project dev groups:
zero.  Yes, *zero*.

One of the dev groups (Roundup, for which I did SVN->git) later moved
to hg for political reasons.  Otherwise those repositories are still
in active use by multiple developers, and have been for a cumulative
hundreds of thousands of hours.

I won't represent that I think none of my finished conversions has
ever had an error; that would be highly unlikely.  What is true is
that any errors they had were so minor that nobody has thought
it was worth bugging me about them.

As a matter of history, GPSD and Hercules were early test conversions.
NUT (Network UPS tools) was reposurgeon's trial by fire; I went into
that with a usable beta-grade tool, came out of it with something good
enough that the much bigger and nastier Blender conversion could be
done by *people who weren't me*.

By the time I did groff, late last year, my tools and procedures for
normal cases were pretty well routinized and bulletproofed.  You can
read about them here:

DVCS Migration HOWTO: http://www.catb.org/esr/dvcs-migration-guide.html

There's even a makefile that semi-automates the conversion steps.

That said, Emacs is a bit abnormal.  The kind of case I'm used to
handling is Subversion repo with a fossil layer of CVS, having on the
close order of a decade of history and a commit count in the 3K-30K
range (this describes GPSD, NUT, Hercules, Roundup, BfW).  

The Emacs history is significantly longer and a bit cruftier than
these, and I've never dealt with a layer of Bazaar before.  Thes
differences do complicate things a bit (I don't normally have to write
custom scripts) but not unmanageably so.

> (Yes, time since conversion is important because the problems are
> usually subtle and don't stick out until much later.)  Detailed
> description of the planned steps during the conversion and how you
> intend to control the quality of each step, will also be appreciated.

I'm enclosing a current copy of the lift script.  I'll add more steps
as I verify them.

As for how I intend to QA them - my strategy has two prongs.  One is
automating everything I can so that I have conditional guarantees of
the form "if tool X is correct, then my results are correct".

The other: historically, I've usually worked in collaboration with a
Mr. Inside, a senior project dev, who checked my work in progress from
a position of intimate knowledge of the project history.

Congratulations, I think you've elected yourself for that job. The
reposurgeon manual is here:

http://www.catb.org/~esr/reposurgeon/reposurgeon.html

> This is great, but doesn't really address the worrisome aspects of the
> conversion we care about.  We no longer care about the elpa branch in
> the bzr repository.  We do care about the few other branches, such as
> emacs-24.  And it is not even clear what will become of those after
> the conversion; the reposurgeon man page cites a limitation related to
> that, allegedly stemming from some (imaginary) bzr confusion between
> branches and repositories, but ends up saying nothing about the
> branches after the conversion.  Will they end up in a single git
> repository, like any other git branches, or won't they?  Will the
> merges between those branches show up as expected in git DAG?  How
> will merges from external branches (such as Org or MH-E) or from local
> feature branches be represented?  Those are much more important issues
> than the ability to split elpa.

You get to tell me what you want to have happen, Mr. Inside.  If
reposurgeon isn't powerful enough to do it, I'll up-gun it until it
is.

Preliminary answer: the git repo after conversion day will, globally
speaking, have the same DAG that the git mirror did before.  Changes will
be localized and consist of (a) commit-clique squashes, and (b) a
few junk branch deletions.

Bazaar's very real branch/repo confusion is probably not relevant,
because my conversion procedure never deals with the Bazaar repository
directly. I start from Andreas's git mirror, which is (presumably)
replicating the branch structure of the entire Bazaar repo every 15
minutes.

If that isn't true, we have some additional problems to solve that
have nothing to do with my tools.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>
emacs.lift
Description: Text document
[Prev in Thread]
Current Thread
[Next in Thread]
Re: The git mirror is *very* badly screwed up, (continued)
Prev by Date: Re: Displaying images "outside" the layout
Next by Date: Re: Displaying images "outside" the layout
Previous by thread: Re: Goals for repo conversion day
Next by thread: Re: Goals for repo conversion day
Index(es):
- Date
- Thread