[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Goals for repo conversion day

From: Eli Zaretskii
Subject: Re: Goals for repo conversion day
Date: Sun, 26 Jan 2014 19:32:52 +0200

> Date: Sat, 25 Jan 2014 16:01:32 -0500
> From: "Eric S. Raymond" <address@hidden>
> Cc: address@hidden, address@hidden
> But as the size and complexity of the repo goes up, so does the value
> of in-band references actually working.  Emacs is an exceptionally *bad*
> case for relying solely on an external reference map, not an exceptionally
> good one.

There's no argument about the higher value of having all the
references resolved.  What I fear of is the inordinate amount of work
that might require, for too little benefit, and the unintended
consequences of a too-deep surgery on the history that will be needed.
Already the effort to get the list of references right devoured many
messages (and I'm sure each message caused a non-trivial amount of
work), and we are still not there (see below).  In particular, it
worries me that you seem to be unable to extract the full list of bzr
revno references, after so many attempts.  Why this doesn't worry you,
and why you still refuse to accept that maybe, just maybe, this is a
lot of effort for a relatively small gain, is beyond me.  If this is
in any way indicative of the other problematic issues of the
conversion, then "Houston, we have a problem", indeed.

> > I'd appreciate if you posted the final list of the references, when
> > you are finished with it, so we could have some QA.
> Here is the current list. It is not final because I expect to resolve
> at least a few more  of these, and it is still possible more fossil
> references could turn up in odd places.

> ChangeLog:
> [...]

I found that at least these ones are missing:

  lisp/ChangeLog.15 references 103083
  lisp/ChangeLog.16 references 103471 and 107149
  src/ChangeLog.12 references 104015 and 103913

> Change comments:
> [...]

This list of 40 references in the commit messages to bzr revisions is
definitely incomplete.  It misses many references (I counted more than
300 overall, including those you show).  Here are just a few that you
missed, and only from the trunk branch:

  r116131 references 116113
  r116056 references 116055
  r115997 references 115992
  r115964 references 115961
  r115920 references 115918
  r115859 references 115838
  r115029 references 112851
  r114978 references 114965
  r114798 references 114795
  r112011 references 112010
  r106733.1.27 references 111919
  r110764.1.510 references 111040
  r110764.1.338 references 111367 and 111368
  r110879 references 110857 (from emacs-24 branch)
  r110306 references 110305
  r99375 references 99362

It sounds like the scripts or methods you are using to find such
references are not catching some of them.  E.g., bare numbers, without
any leading "r" or "revno:" etc. are mostly (or maybe completely)

Given this quality, I once again question the need for all this work.
If we cannot guarantee coverage very close to 100%, what would be the
value of such a partial conversion?  More importantly, do we have
reasonably effective methods of QA for the results?  The omissions I
discovered are based on simple bzr commands followed by manual
inspection (to avoid quite a few false positives); unless we can come
up with better ways that don't involve manual labor, the overall
quality will not be high enough, as manual labor is inherently error

Btw, what about references to repositories of other projects?  Here's
one example (from trunk):

    revno: 110764.1.388
    committer: Bastien Guerry <address@hidden>
    branch nick: emacs-24
    timestamp: Tue 2013-01-08 19:49:37 +0100
      Merge Org up to commit 4cac75153.  Some ChangeLog formatting fixes.

Are we going to replace the git sha1 here by something more universal?
If so, there's much more work around the corner; if not, why does it
make sense to insist on doing that for Emacs's own branches?

> Some of the remaining CVS references cannot be reseolved within the Emacs
> history; they actually point to other projects.  One particularly fertile
> source of these, which I think accounts for this group
>         1.85
>         1.878
>         1.113
>         1.244
>         1.34
>         1.233
> in ChangeLogs, is the CVS history of the erc files before they were merged
> into Emacs.

See above: this is just the tip of the iceberg.  I think you will find
much more of such references, with Org, CEDET, MH-E, and Gnus being
the most frequent ones.  Doesn't leaving those out of this conversion
undermine the goal?

> > The problem is not the size of the repository alone.  The problem is
> > that different portions of a single changeset were committed many
> > revisions apart.  And I don't even understand (and you didn't explain)
> > how will you handle the situation I described above, where a single
> > commit checked in ChangeLog changes for several unrelated commits in
> > the same directory.  Which commit clique will you assign the ChangeLog
> > commit to?  The devil is in the details, but you haven't provided any
> > details about your plans in this matter.  Would you please do that?
> I see we are using the term "changeset" slightly differently, and this has
> produced some confusion.
> The uncoalesced changesets I am looking for are not defined by "all
> share the same ChangeLog entry" (though usually that is the case).
> You are quite right that attempting to coalesce all of those would
> produce perverse results in cases of several unrelated commits.
> Fortunately, most of the unresolved cliques are not like this.  The
> usual case, in this conversion as in others I've seen (such as groff)
> is that an unresolved clique consists of one or several closely
> related changes and one ChangeLog modification, without intervening
> commits by others.  This is what I think of as a changeset.

I thought a "changeset" was well defined in the context of a VCS.  My
definition is a set of changes made as part of working on a single
isolated issue.  IOW, what would have constituted a single indivisible
commit with our current procedures.

Your definition sounds subtly different, and you didn't define
"closely related changes", so it's hard to judge its exact meaning.
As for "one ChangeLog modification" and "without intervening commits",
see below.

> Normally tools such as parsecvs collect these into single changesets.  
> But these converters have a maximum coalescence window.  If such a span
> of commits took place over a longer period of time than the window, it
> won't be coalesced. 

>From a cursory look I had at the current git mirror, no coalescing was
done there.  But perhaps I'm missing something; Andreas, can you
please comment on this?

> When there is CVS in the history, a standard part of my cleanup is
> basically to run a coalescence pass with a very long window.
> Semi-automating this operation, so it (a) doesn't have to be done
> manually, but (b) is easily checked by skilled human judgment, was
> one of the purposes for which I originally wrote reposurgeon.
> Fortunately the bad cases aren't actually very common.

Can we take a real-life use case, please?  Please show the cliques
produced by your analysis in this range of bzr revisions on the trunk:
39997..40058.  You can see the details with these bzr commands:

 . This will show a 1-line summary for every revision in the range:

     bzr log --line -r39997..40058

 . This will show the full commit messages and other meta-data of a
   single revision, 40000 in the example (can also be used with a
   range -rNNN..MMM):

     bzr log --long --show-ids -c40000

 . This will show the files modified/added/deleted by a single
   revision (can also be used with a range -rNNN..MMM):

     bzr status -c40000

The above range of revisions shows a typical routine of commits when
Emacs was using CVS; in particular, "*** empty log message ***" are
most probably ChangeLog commits which usually followed commits of the
files whose log entries are in the ChangeLog change.  Note that the
commit messages are almost always different (they are actually the
ChangeLog entries for the files being committed), although the changes
belong to the same changeset.  Also note how commits by different
people working on separate changesets sometimes overlap, as in
revisions 40033..40038.

How will these be handled during your proposed conversion?  And what
will be the commit messages of the coalesced commits?

> > > > > 5. Unconverted .bzrignores (and possibly .cvsignores) in the history.
> > > > 
> > > > Why is that a problem?
> > > 
> > > See "seamless history browsing".
> > 
> > Sorry, I don't understand.  Please elaborate: what is the relation
> > between these ignore files and history browsing?
> In a properly done conversion, file ignores don't abruptly stop working
> bevcause you browsed back past the point of conversion and what should
> be .gitignore files are nmow .bzrignores or .cvsignores.

So you will be adding .gitignore to revisions where there was none?
If not, how do you plan on attacking this issue?

> > > The way this is working is that I am building a reposurgeon script that
> > > expresses a sequence of edits to Andreas's mirror. On conversion day 
> > > we will apply that script once, after which everyone can re-clone and
> > > go on as before.
> > 
> > Sorry, I don't see how this changes anything.  You are still going to
> > make deep changes to the existing mirror.
> Yes, for arguable values of "deep". As Paul Eggert (I think) said, I'm
> after a result that is stainless steel rather than earthenware. With
> ugly cracks in it.

I have my doubts about the "stainless steel" part, sorry.
Unfortunately, nothing you've said so far contributes to my confidence
in the outcome.  And if the outcome will more like "earthenware" than
"stainless steel", then we might as well continue using what we have
now in the existing mirror.

> > > > Noble goals all of them, but I'm skeptical as to whether they can be
> > > > achieved in practice.  What's worse, we won't know whether some issues
> > > > remained until much later.
> > > 
> > > I know they can be achieved in practice because I have achieved them 
> > > before,
> > > many times.  Most recently in the conversion of the groff history, but
> > > you could check with the maintainers of NUT or Hercules or 
> > > robotfindskitten
> > > or Roundup as well. Or the Blender Foundation - blender is a big 
> > > reposurgeon
> > > conversion done by someone else.
> > 
> > Sorry, been there done that.  The CVS to bzr conversion also seemed
> > flawless until much later.
> There are several differences this time.  One of the most important is that
> the state of the art has advanced.  My tools do things that would have been
> impossible or impractical before they existed.  I have auditing capabilities
> you would probably have to work a bit to even imagine.

The important question here is "is your best good enough?"  I have
absolutely no idea what is the answer to that question, and frankly,
your way of promoting your tools and techniques doesn't help at all.
Neither do the apparent deficiencies in identifying revision
references shown above.

If you really want to build confidence in your methods and tools, some
kind of statistics about the conversion jobs done using them, and the
time passed since the conversion would probably be a good start.
(Yes, time since conversion is important because the problems are
usually subtle and don't stick out until much later.)  Detailed
description of the planned steps during the conversion and how you
intend to control the quality of each step, will also be appreciated.

> As a relatively trivial example - if Stefan or some other person with
> policy authority makes the call, I could reliably split elpa out into
> its own repo with one short command in the reposurgeon DSL.

This is great, but doesn't really address the worrisome aspects of the
conversion we care about.  We no longer care about the elpa branch in
the bzr repository.  We do care about the few other branches, such as
emacs-24.  And it is not even clear what will become of those after
the conversion; the reposurgeon man page cites a limitation related to
that, allegedly stemming from some (imaginary) bzr confusion between
branches and repositories, but ends up saying nothing about the
branches after the conversion.  Will they end up in a single git
repository, like any other git branches, or won't they?  Will the
merges between those branches show up as expected in git DAG?  How
will merges from external branches (such as Org or MH-E) or from local
feature branches be represented?  Those are much more important issues
than the ability to split elpa.

> > > If we find any problems afterwards, I have the tools to fix them. Part of
> > > my commitment is to do that.
> > 
> > I don't think any of us can in good faith give such promises.
> The span of my contributions to Emacs is measures in decades.  I do not 
> think you need to fear that I will vanish before this job is done.

I was talking about the "problems afterwards" part.  I don't question
your intentions, but life is not an entirely predictable endeavor.
Perhaps you have a way to tell the future, but in that case, I may
wish to hire you to help me with my stock investments.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]