[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Subject header matching--once again

From: Lars Henriksen
Subject: Subject header matching--once again
Date: Sat, 2 Nov 2002 22:35:05 +0100
User-agent: Mutt/1.4i


This is a story in four parts about how to recognize a PR in the Subject: header
of an email sent to GNATS.

Back in December 2001 there was a thread discussing Subject: matching, why it
was so restrictive, how to loosen it and make it more useful. This is the mail
that started the discussion:

Milan concluded:

> Well, to summarize, I suggest the following:
> - Not to try to match "re", "fw", etc. and simply look for a substring
>   as stated below.
> - Accept "\<CATEGORY/NUMBER" where CATEGORY is a valid category name and
>   PR NUMBER is present in CATEGORY.
> - Accept "\<PR[ \t/]*NUMBER" where "PR" must be all capitals and NUMBER
>   corresponds to an existing PR number.

The second requirement was changed to:

DB>                                                 Accept
DB> "\<CATEGORY/NUMBER" where CATEGORY is a valid category name and
DB> NUMBER is a valid PR number (in any category).

It was also debated how to handle a Subject: with several matches (the
infamous Subject: containing OS/2). Milan ended the thread:

> The matching was adjusted according to our consensus, except that I
> didn't bother with the matching cycle, only the first possible match is
> considered. I haven't tested it, would you like to do it?

In February, someone noticed that Subject: matching had stopped working.

Andrew Gray submitted a patch:

which fixed two bugs and made Subject: matching work again. But the fix was
not committed and went unnoticed for a time (see PART III).

In May, Mel Hatzis noticed the same problem and submitted a fix. After some
email exchanges Andrew Gray's fix from March was committed so all should be
well. But it wasn't (and it isn't).

For one thing, the agreement from December is not in the GNATS documentation.
This may be just as well because the code does not implement it (fully). The
regular expression now used for matching a PR is:

     \<((PR[ \t/])\|([-a-z0-9_+.]+)/)([0-9]+)

Here regex groups two and three are not used, and the expression may be
simplified to:

     \<(PR[ \t/]\|[-a-z0-9_+.]+/)([0-9]+)

The check for upper case PR is by appearance only because Subject: matching
always ignores case (notice that the check for category name has no upper
case letters). This point has never been raised before, but I think it should
be. I believe that Subject: matching should be case sensitive. Not just to be
able to check for PR, but simply because it is useful (see PART IV).

Furthermore, PR1234 is not accepted. This form was explicitly mentioned in
December as desirable. I would like to add PR#1234 and exclude PR/1234.

My proposal is that the regular expression be changed to

     \<(PR[ \t#]?\|[-\w+.]+/)([0-9]+)

and that the regex search be made case sensitive.

This will pick up the first appearance of any of PR 1234, PR#1234, PR1234,
category/1234 and Category/1234 anywhere in the Subject: header.

During the discussion in May, Mel Hatzis suggested that the regular expression
be made configurable via dbconfig:

Everyone who uttered an opinion was in favour, and so am I. Mel submitted a
patch that is still pending:

> You can now (optionally) include the following in the database-info
> section of the dbconfig file:
>     # The regular expression used to determine whether a PR is referenced
>     # on an email subject line
>     subject-matching {
>         "\\<((PR[ \t/])\\|([-a-z0-9_+.]+/))([0-9]+)"
>         capture-group "4"
>     }
> (The above example is exactly analogous to the built-in default)

The regex group identified by capture-group must capture the PR number.

The proposal is fine, but has a drawback that is worth a discussion. With the
built-in default (the one discussed in parts I-III) both category, if present,
and PR number are checked for validity. With a subject-matching entry in
dbconfig, it is impossible to check the validity of a category: there is no way
of checking the category since only the PR number is captured. Hence, it
is not true that the example above, as stated, is exactly analogous to the
built-in default. At least not by design.

The last remark in the previous paragraph alludes to a bug in Mel's patch. The
code extracts the PR number from the matching substring, but unconditionally
checks for a preceding "PR" or category. The patch also makes gnatsd dump core
if no subject-matching entry is present in dbconfig, but that is an easy fix.

My proposal is to extend Mel's design by allowing the category name to be
captured optionally and checked for validity.

The dbconfig entry syntax could be:

     subject-matching {
          pr-group "integer"
          category-group "integer"

The first integer is the regex group containing the PR number, the second the
regex group containing the category name or 0 (zero) if not used. The following
entry is equivalent to the built-in default (with my amendments):

     subject-matching {
          "\\<(PR[ \t#]?\\|([-\\w+.]+)/)([0-9]+)"
          pr-group "3"
          category-group "2"

An entry that does not use category is:

     subject-matching {
          "\\<PR[ \t#]?([0-9]+)"
          pr-group "1"
          category-group "0"

It should also be decided which syntax bits to use for Subject: matching.
At present only RE_NO_BK_PARENS is set, but why? Setting e.g. RE_NO_BK_VBAR
would avoid the need to escape the alternation operator. Milan suggested
using the same syntax bits as the rest of gnats:


but these are an issue in their own right, and this email is already becoming
too long.

Lars Henriksen

reply via email to

[Prev in Thread] Current Thread [Next in Thread]