[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pan-users] Re: global scores
From: |
Duncan |
Subject: |
[Pan-users] Re: global scores |
Date: |
Sat, 12 May 2007 11:11:33 +0000 (UTC) |
User-agent: |
Pan/0.129 (Benson & Hedges Moscow Gold) |
Thufir <address@hidden> posted
address@hidden, excerpted below, on Sat, 12 May 2007 07:57:13
+0000:
> for score files, some e-mail addresses have underscores or other
> characters. when must the escape be used? Just for dots?
That's an interesting question, as the answer is somewhat complicated.
FWIW, unescaped dots /will/ match, but they match any (single)
character. Thus, a regex of gmail.com would indeed match a literal
gmail.com, but would also match gmailqcom, address@hidden, etc, but would NOT
match gmail..com or gmailxxcom, because that's TWO characters, and dot
matches only ONE character.
Generally, you'll need to watch it for anything that's not a-z,0-9. Many
but not all symbols and punctuation have special meanings in regex. Some
special chars to be alert for and what they mean (note that not all these
are valid in all headers, but that won't affect whether they match in the
scorefile or not):
* matches any number (zero or more) of the preceding, so .* matches
anything (literally, zero or more of any character). ? matches zero or
one of the preceding, so is useful for matching something that may or may
not be there. .? therefore matches a single character, that may or may
not be there. boots? would match boot or boots (s may or may not be
there) but not bootsss (except that the expression as shown isn't
anchored, so any junk including s's on either side would match).
\ is the escape character, so to match a literal \, use \\. + matches
one or more of the preceding, so ..* is exactly the same as .+ , both
meaning one or more of any character.
^ anchors at the left, $ at the right, where "anchor" means there's
nothing outside the specified match. Thus, using the above boots?
example, bootsss would still match as would bootadsfasdfe and simply
boot. To make it match /only/ boot or boots, you'd use ^boots?$ .
Anything additional on the line would fail the match. (However, in our
case we are talking about header lines, with the "match" only being on
the value of the header. Header lines by definition have a header name,
such as from, followed by a colon, followed by a space, followed by the
value. Therefore, what is actually being matched is anything after the
header name, colon, and space. Also note that header lines may be
"folded" if they are too long. IDR the full folding spec from the RFC,
but you may look it up if interested. Meanwhile, just keep in mind that
the header may extend over multiple lines. Most often, this will occur
with headers such as the path header or the references header, which get
appended to, in the first case as the post propagates from server to
server, in the second as replies get nested in the thread. Long
propagation paths or deep thread reply nesting commonly causes header
folding.)
[] indicates a character class. Any of the enumerated characters will
match. A range may be indicated with a dash (which can be matched
literally by placing it first, after a ^ if any), and a ^ as the first
character negates. As with a -, a ] must be placed first (or escaped) as
otherwise it would indicate the end of the character class Thus,
[a-zA-Z0-9] indicates all ASCII letters plus numbers. (Note that
normally, regex are case sensitive so [a-z] and [A-Z] are different.
However, pan is normally case insensitive, so it won't matter to it. You
can force case sensitivity by using keyword= instead of keyword:.)
[bcf]at would match bat, cat, and fat, but not mat. Also see the POSIX
character classes, below.
You can specify a limited range of repeats (as opposed to + and * which
are unlimited) by using {n,m}, where n and m are the minumum and maximum
number of repeats. Leaving one out makes it unlimited at that end.
Thus, ba+d matches bad, baaaaaaaad, baaaaaaaaaaaaaaaaaaaaad, etc, but not
bd (ba*d would also match bd, zero or more a's). ba{1,3}d matches bad,
baad, and baaad, but not bd or baaaad. ba{,3}d would be the same as
ba?a?a?d and match bd and up to three a's. ba{1,}d would be the same as
ba+d. ba{2,}d would require two or more a's... **IMPORTANT** I've not
specifically tested pan but some regex implementations treat the
unescaped {} as literals and escaped {} as range indicators, some treat
the escaped as literal and unescaped as range indicators. IF USING {}
THEREFORE, TEST YOUR SCORES BEFORE RELYING ON THEM!!
Parenthesis (() indicate grouping. (They also save the included match
for further use, say in substitution, but pan's scoring doesn't need or
do substitution so you can safely ignore that for now.) | indicates
alternatives. Thus, (dog)|(cat) will match the three letters dog, OR
match the three letters cat. It will NOT match dogcat, or dat, or cog.
Note that it's occasionally useful to match a sequence which may or may
not be there, as ((dog)|), the dog may be there or not.
Escaped letters often have other meanings, depending on the regex
implementation. pan uses pcre, perl compatible regular expressions, an
extremely rich matching language one could (and many have) literally
write /chapters/ on. I'll just mention a couple such escaped letter
matches that you may find useful, \s matches a literal space, and \t
matches a tab (the capitalized forms \S and \T would match NOT space and
NOT tab, but I'm not sure if they are implemented), and a general idea
you can look up for more if desired, word borders, with the \b, \B, \w,
and \W.
Finally, the original regex matching language, as so many things
computer, was designed with ASCII in mind. All those "funny"
international symbols, with `, ^, etc in combination with letters, can
make things "interesting". Also, at least one western charset has Z as a
letter somewhere in the middle of its alpha chars, so A-Z won't have the
intended effect there! To address these and other "interesting"
situations without making things /too/ complicated for those not using
those charactersets, POSIX character classes were added. There's also
collating element equivalency clases which work somewhat similarly but
which I'm not going to cover here. Back to POSIX character class
matching. Within a [] character class, one can further include POSIX
character classes, denoted with [:classname:]. (It's important to note
that these are recognized within [] only, so to use them alone, you use
[[:classname:]].) Standard POSIX character classes include:
alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space,
upper, and xdigit.
Thus, instead of [a-zA-Z0-9] which may not include "exotic" alphabetic
characters, [[:alnum:]] may be used. [[:space:]] includes vertical and
horizontal space, so spaces, tabs, line and form feeds, and carriage
returns. [[:print:]] is all printable characters, [[:cntrl:]] is the
reverse, control characters. Lower and upper won't really matter in our
context since pan is case insensitive by default, graph is all graphical
characters (similar to print, I'm not sure the difference but believe one
includes [:space:] while the other doesn't).
In addition to what has already been covered, there are all sorts of
additional matchings, positive and negative lookahead and lookbehind
(suppose you use .*ad but don't want it to match covad, a DSL provider,
for instance, a negative lookbehind may be just the ticket), even ways of
executing external programs and returning the results for the match (I
doubt pan implements that but honestly haven't tried), all /sorts/ of
fancy and exotic stuff.
As I mentioned above, literally chapters, if not entire books, could be
(and have been) written on the subject of regex. However, that should be
good for an introduction. Basically, anytime you use something outside
of an alphanumeric literal match, consider the possibility that it may
need escaped, and test before before relying on it. Do that and keep in
mind the basics, .*+?()[]{}\ , that do need escaped, and you'll be
covered well over 90% of the time, certainly within pan's limited usage,
for scoring headers. The rest is nice to know for those special cases,
but not generally necessary, and can be looked up (save this post, or
google on "regular expressions", or even "perl compatible regular
expressions") if necessary.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
- [Pan-users] global scores, Thufir, 2007/05/04
- [Pan-users] Re: global scores, Duncan, 2007/05/12
- Re: [Pan-users] Re: global scores, Dave, 2007/05/12
- [Pan-users] Re: global scores, Duncan, 2007/05/12
- [Pan-users] Re: global scores, Thufir, 2007/05/12
- [Pan-users] Re: global scores, Duncan, 2007/05/13
- Re: [Pan-users] Re: global scores, Dave, 2007/05/12
[Pan-users] Re: global scores, Lenny_Nero, 2007/05/25