gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Some issues


From: Tom Lord
Subject: Re: [Gnu-arch-users] Some issues
Date: Tue, 15 Jun 2004 12:58:47 -0700 (PDT)


    > From: Florian Weimer <address@hidden>

You asked a question about a project in which there are 1200 commits
per month.  Won't the patch log grow by 1200 entries per month?
Won't the tree therefore become huge?  Can this be safely pruned and
if so how?

I have a detailed answer for you below.  I've designed the
infrastructure for a busy, large software project that wants 1200
commits per month.  My solution preserves complete history and
sacrifices no merging capabilities.  For developers, it feels almost
identical to using a centralized CVS set-up but offers some
improvements over that.  The patch log will grow, with my solution, at
the rate of 10s of messages per month and will have a fixed-sized pool
of O(1200) messages that summarize the very latest part of the
development.

Before I explain the answer, I want to same some general things about
how I built it and what that implies for the future of arch.  People
might gain further insight into my recent "FEATURE PLAN" posts from
this.

I designed the solution using a couple of "devices".  What is a
device?  It's a coherent part of a larger deployment of arch.  For
example, "a public mirror" is an example of a device, a "patch queue
manager" is a device, a "submission branch" is a device.  You build a
whole infrastructure for a project by combining devices and by stating
rules for how additional devices can be added.  You might use some
devices to build the archive(s) for your mainline; you might just
describe the other devices an outside programmer should use if they
want to prepare a contribution.

Devices are interesting because they have definate, objective
properties that determine their costs and performance characteristics.
For example, if I know the size of a project and the commit rate and
the average size of commits, then I can predict how large a "public
mirror" will be, how much I/O bandwidth it will need, what the load on
the server will be.   If I can answer questions like that for all of
the devices in my overall design, then I can answer the same questions
for the overall design itself.

In principle, we can get to a point where we have a catalog of
devices, tools to help implement them, knowledge of their costs and
properties in isolation and combination....

In that state, we can listen to a problem statement (like "1200
commits/month"), design a communicable solution on paper whose
construction plan then automatically follows, and report the expected
costs of operation of that solution.

In other words, we can take all the guesswork out of deploying arch
and create a set of patterns, each of which solves a known kind of
problem.  To design an infrastructure, pick the patterns/devices that
address your needs, find a good way to resolve their various
contextual requirements, and then add up what it'll cost you to build
and run.


    > What does [log pruning] mean for tla's merging capabilities?
    > Will users notice that the logs suddenly went away?  Could you
    > do that monthly, too, without any negative impact?

    > >> I'd love to look at a project which uses tla, hasn't got a designated
    > >> patch integrator, and has a significant changeset creation rate.

    > > Define "significant".

    > More than 1200 changesets per month.

There are projects where that kind of rate makes sense.  GCC or the
kernel is large enough and modular in such a way that people can
concurrently hack different parts of it without interfering with one
another too badly.  They can always put the breaks on if things get
too tangled up.  1200 commits/ month is high but I can imagine higher
rates, too.

1200 commits/month is a fun scale: if you check out a tree from the
project, it is likely to be out of date in less than an hour.  If the
commits are clustered in time (e.g., mostly happen during business
hours), then a tree will be out of date in minutes.

It being nearly impossible for any programmer to have an up-to-date
tree, two things follow:

  0) The revision control infrastructure must permit programmers
     to commit from a non-up-to-date tree.

  1) You would think that after you commit a tree, creating revision
     R, that your working directory is a faithful copy of 
     exactly what revision R looks like.

     But that can not be so if the commit rate is this high because
     odds are, when R is committed, some parts the working directory
     are out of date.  So after the commit, the working directory will
     be a random tree, partly out of date yet already merged the
     changes found in revision R itself.  In other words, not only
     isn't the working directory R, it isn't _any_ revision.

     And that's a fact of life that's inherent in 1200 commits/month.
     It's not arch-specific.

Already, before we get to the question of pruning patch logs, when
you're talking about this high a commit rate, you will have a lot to
think about to manage such change well.

What's the point of a 1200 commit/month version if you can essentially
never be up-to-date with it?  Presumably the point must have something
to do with continuous integration.  

I think we have to view the 1200 commit/month as _logically_ (not
necessarily literally) consisting of several branches, and the busy
1200 commit version as being the point at which the several branches
are continuous merged, semi-automatically.  

People and testing tools are rarely up-to-date with a busy integration
branch but are often _close_ to up-to-date.  So the integration branch
revisions get plenty of use and testing -- just never the very latest
one.  Instead, that use and testing happens concurrently with the
creation of the next few revisions.   That conccurrency can represent
just a beneficial efficiency rather than an out-of-control project.

This screams out, in arch, for application of a patch queue manager.

A patch queue manager is a server-like process that builds a (literal)
integration branch by auomatically merging in changes from other
designated branches.  Patch queue managers vary in features but are
generally trivial programs.  For the 1200-commit project we're
thinking about, I think you would want an email-driven pqm.

Each committer in your project or small group of committers should
have their own archive and branch of your busy mainline.   Generally,
they always commit there.

There commit hook should send email to the patch queue manager.

The patch queue manager will (normally) be the sole committer to your
mainline.  When it receives mail about a commit on a branch, it will
simply star-merge that into the mainline and, if there are no
conflicts, commit.  If there are conflicts, send mail back to the
committer alerting him of that.

The day to day work of committers, in this scenario, is almost
identical to what you are used to.   They work this way:

        1. Create a new branch by tagging mainline

        2. Work normally on this branch, as if it 
           were mainline.

        3. Updating a branch working directory from the mainline
           (catching up to the integration branch) works
           in the usual way except that you might want to 
           follow it with a commit to the branch.

and that's that.  You could easily let your mainline run at many times
more than 1200 commits / month provided that changes rarely conflict.

Think of all the problems that this solves:

  1. Most work is done on client machines.   Even though your
     project is very busy, the integration server will not
     be under heavy load.

  2. Brief interruptions of the integregration server (such as a 
     network outage) will have no negative impact on client-side
     work.    Programmers can just keep working in their branches,
     stacking up submissions for the integration queue.

  3. Long interruptions of the integration server are easily 
     corrected by starting a new mainline from a recent mainline
     snapshot.  Contributor branches can be switched to the
     new server and work can continue normally.

  4. The system is fully securable.

     Every branch commit and pqm request can be cryptographically
     signed by a committer and verified by both the pqm and other
     committers.

     The revisions pqm itself creates can be cryptographically signed
     as a matter of convenience but since this entails giving a
     private key to a server process, we need to look more carefully
     at how to secure this.

     The beauty of the pqm's role in the proposed deployment is that
     it is a deterministic and repeatable role.   That is, the
     pqm is going to produce a record of what it claims to have merged
     and the order in which those merges took place.    From that
     record, the patch queue manager can be replayed by anyone --
     anyone can run those merges and get exactly the same results.
     The pqm can be finally and fully secured simply by, whenever
     there is doubt, double checking the results it has produced.

  5. If your project is distributed beyond just a local network,
     the system will encourage, in a natural way, that distributed
     backups and redundent copies of your project are scattered around
     the world.

But we're still left with your original issue, namely, what about the
patch log size.  In fact, we've made the issue worse now because each
commit has turned into two commits: one to the branch, then one when
merging to the mainline.  So although mainline will experience 1200
commits/month, it will pick up 2400 patch log entries / month.

Yes, you can prune the heck of those patch logs and the system of pqm
and branches I sketched above gives you a natural and effective way to
do it. 

The first thing to notice is that in this system, the history stored
on branches is useful only until it is merged into the mainline, and
until the branch catches back up with that merge.  After that, the
branch history is entirely redundent, having been reproduced on the
mainline.   Therefore, it's reasonable to regard these committer
branches as transient and throw-away.

If you had a working directory with too many stray backup files and
object files and you knew that there was nothing else in there worth
saving, you might decide to get to a clean working directory just by
`rm -rf'ing that one and checking out a fresh one.

By analogy, once a week (say, first thing each monday morning), each
commiter branch can be replaced by a new one that contains just the
as-yet unmerged changes.  The old branch patch logs can be pruned from
the mainline.  Counting that up: instead of each branch _adding_ 1200
new log messages of its own to mainline every month, the mainline tree
will have a basically fixed number of branch log messages -- about
300.

That still leaves the mainline log messages coming in at 1200/month.
We can't say what to do with or about those without having a clear
idea of what they are for and your problem statement is underspecified
in that way.

We know we want them for integration testing.    For example, those
1200 log messages document the atomic changes that define for us
the granularity of things like binary-hunt regression finding.

We can presume that we want these 1200 log messages as a kind of
"newspaper".  Nobody is likely to read all of them in depth but many
people are likely to skim the list, pick out a few of personal
interest, and study those.

So we have three uses for these 1200 logs:

        continuous integration
        regression hunting
        newspaper

All of those are transient uses.   An old newspaper is not likely to
be interesting.   When we hunt for regressions, it's likely to be 
within the most recent part of the integration branch.   Continuous
integratino implies continuous merging with branches and so there is
little value to having history sensative merging of integration branch
revisions from a year ago.

Are there any long term uses of the 1200 messages other than archival
for that rare day when, to learn why the satellite blew up on the
launch pad, you have to regression hunt in a 3-year old integration
branch?

I doubt that there are any other long-term uses because these 1200
logs are horrible for human consumption, other than as a newspaper.
History may get written up in the papers shortly after it happens but
a real understand of it requires its presentation in a more organized,
logically structured way.

It seems to me that, to be thorough, your project with a 1200
commit/month integration branch ought to take on an additional
sub-project, perhaps one it has already taken on in the form of NEWS
files or changelogs.   This other subtask is to also author the
official, narrative history of the development:  to explain the order
behind those 1200 commits by reducing it to a smaller number of
purposeful and self-contained tasks accomplished.

In other words, those 1200 messages are fine but perhaps your
programmers should write an additional 30 messages/month that sum up
those 1200.

Combining those ideas: the 1200 have only transient use and 30
permanent messages are needed that sum up the 1200.  Therefore, we can
prune the 1200 agressively (using a technique I'll show below) and
should concentrate on making the 30 messages easy to write and
maintain.

To accomplish that, let's make a new branch and expand the
functionality of the patch queue manager.   The new branch will be
called the "narrative branch" and that's where we'll assemble the 30
permanent messages.   So now we have a (shallow) tree of branches:

                      narrative
                       branch
                      /
                integration
                 branch
                /  |      \
          committer ...   ...
            branches      

The narrative branch will, like the integration branch, be driven
by a patch queue manager.   Unlike the integration branch, there
is never any danger of merge conflicts on the narrative branch --
it's operation is fully automated.

Narrative branch revisons will be formed in three steps:

        1. get a recent revision from the integration branch.
        2. sync-tree with the most recent narrative branch revision.
        3. commit

In other words, the narrative branch is just a series of snapshots
of the integration branch.

New narrative revisions can be triggered by a message to the pqm.
That pqm message can contain the log message for the narrative commit
or that log message can be taken from someplace else.

The log messages on the narrative branch will constitute the 
narrative history of the integration tree.   For example, 
if committer Aaron fixes 100 minor, low priority bugs on his branch,
resulting in 100 integration commits, he might finish up that month by
sending a message to the narrative branch that's logged as "Fixed 100
boring minor bugs".   If committer Bob implemented a new feature over
the course of 40 integration commits, he can finish that up with a
message to the narrative branch describing the new feature and
providing a coherent overview of its implementation (rather than a
list of 40 "and then I did this" integration messages).

So for history, and for users/developers not following the integration
branch, the narrative branch effectively is the mainline of
development.  It's trickle of messages summarizes the large changes
made between revisions.  It is a slower-moving, more polished, more
human-reader-friendly account of the integration history.

Oops we still have those incoming 1200 log messages on the integration
branch, and now those are copied to the narrative branch.   Easy
enough to fix:

The integration branch has transient value and archival value, but not
much more.   It's ripe for cycling.   Just as every week we can cycle
the committer branches, every two or three months we can cycle the
integration branch.   This can be done seemlessly, of course -- it can
be fully automated.   Programmers won't even have to replace their
working directories.

Once every month or three:

        1. cycle the integration branch and discard old 
           integration log messages

        2. create a new revision in the narrative branch joining
           (becoming a branch of) the new integration branch and
           discarding old integration patch log entries

        3. Switch the committer branches to the new integration branch
           and, again, remove old integration logs.

Think of the benefits and how beautifully this captures the cyclic
nature of most software projects:

First, for all of our branches, the rate of patch log growth 
is now just the rate of narrative branch commits -- something
like 30.

Second, all of our branches have, at all times, a finite number
(O(1200) in this case) of the most recent integration logs.  For
humans, that means there are both recent newspapers and long-term
chronicles.   For tools, it means that the features of the integration
branch (merging, testing) are readily available.

Third, cycling an integration branch does _not_ mean throwing the
old branch away.   Of course it should be kept for archival purposes
but, additionally, it can take on new utility if your development
process is phased or is otherwise partitioned into functional areas of
process responsibility.    

For example, developers can use the occaision of cycling to hand off
an integration branch to release engineers.    Even though centralized
development is being used, there's no need (imposed by the tools) to
"freeze" development during release engineering, you can just fork the
integration branch in the course of cycling it.

I described the process of cycling the integration branch as an atomic
thing:  cycle integration, resync narrative, resync committer branches
....  It doesn't have to be atomic.  It can be usefully asynchronous.
You can go from this:

                      Full Out Development Mode


                      narrative
                       branch
                      /
                old integration
                 branch
                /  |      \
          committer ...   ...
            branches      



to this:


                  Release Engineering 
                  and Concurrent Development Mode

                narrative
                 branch
                /
          old integration   =cycles-to=>   new integration
           branch                          branch
          /  |      \                    /   |    \
    some committer ...  =switches-to=> other committer
      branches                         branches ...


leaving some committers behind to do release engineering while others
start the new mainline.   Leaving the narrative branch in place since
that is what the release will be cut from.

When all committers have finally migrated and/or the release is done,
the narrative branch can be switched to the new integration branch and
you're back in Full Out Development Mode.

One fine point:  when logs are pruned in this scenario, the narrative
branch should record that fact and the pruning should happen in a
single commit with no other changes.  Thus, the full integration
history of the narrative branch can be replayed, in part or in whole,
into the narrative branch.

The net result is that arch will present the long term history of your
busy project as the narrative.  Every part of the narrative has behind
it, easilly accessible, a corresponding set of integration records
that show in detail how it was done.  At every instant, the most
recent integration records will already be present in the narrative
branch.

For committers, this will feel much like CVS-style centralized
development with two exceptions:

  1) once a week or month or so, commiters will have to 
     run as script to switch or cycle their branches

  2) unlike CVS, committers will have a complete revison
     controlled record of all the weird "slightly out of date"
     states their working directories were in when the
     committed something to the mainline


For third parties, the history of the project is made available in a
narrative form, designed for human consumption to summarize 
the development history, and behind every step of that summary stands
a complete and detailed record of the minute-to-minute changes that
went into it.

-t






reply via email to

[Prev in Thread] Current Thread [Next in Thread]