pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Pan-users] Composing regex for Pan


From: Michael R. McCarrey
Subject: RE: [Pan-users] Composing regex for Pan
Date: Sat, 13 Mar 2004 22:36:26 -0800

On Thu, 2004-03-11 at 06:04, Paul Hudson wrote:
> It did make it to me at least.
> 
> What have you tried? 
Hi Paul,

The only thing that even comes close so far is:
(from the 'Edit Filter' menu)

ALL OF:
Subject does not match regex [A-Z !=/0-9]+$
   "     "    "    "     "   ^\(|^\=|^\~
   "     "    "    "     "   ^\*|^\!|^;

Some things break it, though.
 A DOG & CAT
breaks it, for instance, as does
THAT DOG?? STINKS!!!

My attempts at "repairing" this only make it worse.

> 
> What rule (expressed in English :-) are you trying to create? I think, from
> your examples, it might be:
> 
> Match all subject lines with at least one word that's at least two letters
> long, all in upper case?
> 
> Something like
> 
>  \b[:upper:]{2,}\b
This dumped all replies. The regex animal book doesn't explain those
constructs very well (nor have any of the web sites I've looked at). Too
much knowlege is assumed right off the bat. I can get many of the
examples to work within the program: "Regex Coach", but not in Pan. One
thing, I think I'm mixing up grep/egrep, ed/sed, python and perl syntax.
Of course, the book seems to also, so if I am, I've been misled.

> 
> looks like it might work (not tested, though) - as long as the regexps in
> Pan are really PCRE ones enough to support the [: notation (see
> http://www.pcre.org/pcre.txt). An equivalent is
> 
> (?-i)\b[A-Z]{2,}\b
This works, sort-of, if I select NONE OF:, but things like "!?&" in the
string break it.

> 
> (also not tested)
> 
> (The (? notation is how options like caseless matching are changed.)
What I've been reading says that the ? refers to "zero or more times"
(this must be my "snake & necklace" problem again).

> 
> P.
> 
I think what I'm trying to do isn't possible, or at least it's way
beyond me in this environment.

I want to dump as many of the annoying spam, troll and AOL-keyboard
posts as I can, which I think, will require parsing the string's
individual characters, multiple times (maybe my approach is flawed?)
Once for ALL CAPS (if true, dump the post, regardless of additional
characters in the string). After that, it gets interesting. Now we
should have mixed-case alpha and/or alpha-numeric (or "should" have).
Next, filter on multiple instances (2 or more to start) of any
non-alpha, printable characters, anywhere in the string. Dump the
matches. Then filter those results against any other specific criteria
until what remains are subjects that look "normal" as in: Just a test
post | Just A Test Post | Just a Test Post #10 | any of the previous,
prefixed by "Re:", ect.

Not exactialy English, but I hope this makes some sense, as I can't
think of a better way to explain it at the moment (and I've been going
at this for several hours).

I could do this in assembly with table lookups, a compare, and some 
conditional jumps, but this stuff almost makes me regret ever building
my first 4004 <g>

My thanks to everyone who has offered suggestions.

Mike (still puttering away at it)

<snip>
-- 
Professionals built the Titanic ...
   Amateurs built the Ark
Forever, oh Lord, Thy Word is Settled in Heaven. Psalms 119:89
sola fide, sola Scriptura. sola gratia, sola Christo






reply via email to

[Prev in Thread] Current Thread [Next in Thread]