[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pan-users] Re: global scores

From: Duncan
Subject: [Pan-users] Re: global scores
Date: Sat, 12 May 2007 11:11:33 +0000 (UTC)
User-agent: Pan/0.129 (Benson & Hedges Moscow Gold)

Thufir <address@hidden> posted
address@hidden, excerpted below, on  Sat, 12 May 2007 07:57:13

> for score files, some e-mail addresses have underscores or other
> characters.  when must the escape be used?  Just for dots?

That's an interesting question, as the answer is somewhat complicated.

FWIW, unescaped dots /will/ match, but they match any (single) 
character.  Thus, a regex of would indeed match a literal, but would also match gmailqcom, address@hidden, etc, but would NOT 
match or gmailxxcom, because that's TWO characters, and dot 
matches only ONE character.

Generally, you'll need to watch it for anything that's not a-z,0-9.  Many 
but not all symbols and punctuation have special meanings in regex.  Some 
special chars to be alert for and what they mean (note that not all these 
are valid in all headers, but that won't affect whether they match in the 
scorefile or not):

* matches any number (zero or more) of the preceding, so .* matches 
anything (literally, zero or more of any character).  ? matches zero or 
one of the preceding, so is useful for matching something that may or may 
not be there.  .? therefore matches a single character, that may or may 
not be there.  boots? would match boot or boots (s may or may not be 
there) but not bootsss (except that the expression as shown isn't 
anchored, so any junk including s's on either side would match).

\ is the escape character, so to match a literal \, use \\.  + matches 
one or more of the preceding, so ..* is exactly the same as .+ , both 
meaning one or more of any character.

^ anchors at the left, $ at the right, where "anchor" means there's 
nothing outside the specified match.  Thus, using the above boots? 
example, bootsss would still match as would bootadsfasdfe and simply 
boot.  To make it match /only/ boot or boots, you'd use ^boots?$ .  
Anything additional on the line would fail the match.  (However, in our 
case we are talking about header lines, with the "match" only being on 
the value of the header.  Header lines by definition have a header name, 
such as from, followed by a colon, followed by a space, followed by the 
value.  Therefore, what is actually being matched is anything after the 
header name, colon, and space.  Also note that header lines may be 
"folded" if they are too long.  IDR the full folding spec from the RFC, 
but you may look it up if interested.  Meanwhile, just keep in mind that 
the header may extend over multiple lines.  Most often, this will occur 
with headers such as the path header or the references header, which get 
appended to, in the first case as the post propagates from server to 
server, in the second as replies get nested in the thread.  Long 
propagation paths or deep thread reply nesting commonly causes header 

[] indicates a character class.  Any of the enumerated characters will 
match.  A range may be indicated with a dash (which can be matched 
literally by placing it first, after a ^ if any), and a ^ as the first 
character negates.  As with a -, a ] must be placed first (or escaped) as 
otherwise it would indicate the end of the character class  Thus, 
[a-zA-Z0-9] indicates all ASCII letters plus numbers.  (Note that 
normally, regex are case sensitive so [a-z] and [A-Z] are different.  
However, pan is normally case insensitive, so it won't matter to it.  You 
can force case sensitivity by using keyword= instead of keyword:.) 
[bcf]at would match bat, cat, and fat, but not mat.  Also see the POSIX 
character classes, below.

You can specify a limited range of repeats (as opposed to + and * which 
are unlimited) by using {n,m}, where n and m are the minumum and maximum 
number of repeats.  Leaving one out makes it unlimited at that end.  
Thus, ba+d matches bad, baaaaaaaad, baaaaaaaaaaaaaaaaaaaaad, etc, but not 
bd (ba*d would also match bd, zero or more a's).  ba{1,3}d matches bad, 
baad, and baaad, but not bd or baaaad.  ba{,3}d would be the same as 
ba?a?a?d and match bd and up to three a's.  ba{1,}d would be the same as 
ba+d.  ba{2,}d would require two or more a's...  **IMPORTANT**  I've not 
specifically tested pan but some regex implementations treat the 
unescaped {} as literals and escaped {} as range indicators, some treat 
the escaped as literal and unescaped as range indicators.  IF USING {} 

Parenthesis (() indicate grouping.  (They also save the included match 
for further use, say in substitution, but pan's scoring doesn't need or 
do substitution so you can safely ignore that for now.)  | indicates 
alternatives.  Thus, (dog)|(cat) will match the three letters dog, OR 
match the three letters cat.  It will NOT match dogcat, or dat, or cog.  
Note that it's occasionally useful to match a sequence which may or may 
not be there, as ((dog)|), the dog may be there or not.

Escaped letters often have other meanings, depending on the regex 
implementation.  pan uses pcre, perl compatible regular expressions, an 
extremely rich matching language one could (and many have) literally 
write /chapters/ on.  I'll just mention a couple such escaped letter 
matches that you may find useful, \s matches a literal space, and \t 
matches a tab (the capitalized forms \S and \T would match NOT space and 
NOT tab, but I'm not sure if they are implemented), and a general idea 
you can look up for more if desired, word borders, with the \b, \B, \w, 
and \W.

Finally, the original regex matching language, as so many things 
computer, was designed with ASCII in mind.  All those "funny" 
international symbols, with `, ^, etc in combination with letters, can 
make things "interesting".  Also, at least one western charset has Z as a 
letter somewhere in the middle of its alpha chars, so A-Z won't have the 
intended effect there!  To address these and other "interesting" 
situations without making things /too/ complicated for those not using 
those charactersets, POSIX character classes were added.  There's also 
collating element equivalency clases which work somewhat similarly but 
which I'm not going to cover here.  Back to POSIX character class 
matching.  Within a [] character class, one can further include POSIX 
character classes, denoted with [:classname:].  (It's important to note 
that these are recognized within [] only, so to use them alone, you use 
[[:classname:]].)  Standard POSIX character classes include:

alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, 
upper, and xdigit.

Thus, instead of [a-zA-Z0-9] which may not include "exotic" alphabetic 
characters, [[:alnum:]] may be used.  [[:space:]] includes vertical and 
horizontal space, so spaces, tabs, line and form feeds, and carriage 
returns.  [[:print:]] is all printable characters, [[:cntrl:]] is the 
reverse, control characters.  Lower and upper won't really matter in our 
context since pan is case insensitive by default, graph is all graphical 
characters (similar to print, I'm not sure the difference but believe one 
includes [:space:] while the other doesn't).

In addition to what has already been covered, there are all sorts of 
additional matchings, positive and negative lookahead and lookbehind 
(suppose you use .*ad but don't want it to match covad, a DSL provider, 
for instance, a negative lookbehind may be just the ticket), even ways of 
executing external programs and returning the results for the match (I 
doubt pan implements that but honestly haven't tried), all /sorts/ of 
fancy and exotic stuff.  

As I mentioned above, literally chapters, if not entire books, could be 
(and have been) written on the subject of regex.  However, that should be 
good for an introduction.  Basically, anytime you use something outside 
of an alphanumeric literal match, consider the possibility that it may 
need escaped, and test before before relying on it.  Do that and keep in 
mind the basics, .*+?()[]{}\ , that do need escaped, and you'll be 
covered well over 90% of the time, certainly within pan's limited usage, 
for scoring headers.  The rest is nice to know for those special cases, 
but not generally necessary, and can be looked up (save this post, or 
google on "regular expressions", or even "perl compatible regular 
expressions") if necessary.

Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

reply via email to

[Prev in Thread] Current Thread [Next in Thread]