pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-users] Select Headers with RE in the Subject/Author Entryfield?


From: Duncan
Subject: Re: [Pan-users] Select Headers with RE in the Subject/Author Entryfield?
Date: Mon, 8 Jun 2015 11:31:48 +0000 (UTC)
User-agent: Pan/0.140 (Chocolate Salty Balls; GIT af87825)

Heinz Mezera posted on Mon, 08 Jun 2015 09:16:22 +0200 as excerpted:

> I'd like to select Headers in the Header-Pan with a regular expresssion
> in the Subject/Author field and need your help. Is this possible and how
> do I do it.
> 
> I want to select all headers
> - starting with three alphabetic characters
> - followed by an underscore
> - two digits after the underscore
> - and any number of charcters afterwards.
> 
> PAN Info:
> Pan 0.139 Sexual Chocolate (GIT bf56508 git://git.gnome.org/pan2;
> i686-pc-linux-gnu)

** Note that after changing the search expression, you may have to toggle 
to something else (say subject), then back to regex, in ordered to get it 
to "take".  I noticed it would dynamically refilter part of the time, but 
would appear to stall out and not update without the toggle, sometimes.  
Given that hint, and the caveat that I tested the components separately 
but not together, as I didn't have posts handy that matched that specific 
pattern...

One way to do it:

^[[:alpha:]]{3}_[[:digit:]]{2}.*$

^ = zero-width match at the beginning/left
$ = same at the end/right

Non-special characters match themselves.  Letters, digits, _, etc, are 
non-special.  

. matches exactly one occurrence of any character (and *, mentioned again 
below, is any number including zero, so .* is a full wildcard, including 
matching nothing). 

[] encloses a "character class".  Such character classes can include 
ranges of characters [a-z], individual lists [123], and/or category 
classes (I seem to have forgotten the proper term ATM) like the above, 
enclosed in further [:xxx:] marks, thus the nesting.

So [[:alpha:][:digit:]] and [a-zA-Z0-9] would both match alphanumeric 
characters in ASCII, tho pan's regex is case insensitive so both a-z and 
A-Z wouldn't be needed for pan, only one or the other.  You can also do 
things like [[:digit:]abc._], to match digits, abc, and the individual 
characters . and _.  The significance of the [:xxx:] matches, however, is 
that they work across character sets, so [:alpha:] matches letters that 
would be skipped in character-sets where a-z doesn't include all letters 
due to strange ordering or something.

To match a - in a character-class, put it at the beginning so it can't 
specify a range.  The \ char is the escape char, both inside and outside 
a character-class, so you can use \] to match a literal ] for instance, 
and of course \\ to match a literal \.

Additionally, you can specify a /negative/ character-class with ^ as the 
first character (outside a character-class, it means match the beginning, 
inside, as the first character of the class, it negates the class, inside 
as anything other than the first char, it matches itself normally).  So 
[^abc] means any character /but/ abc.

Significantly, character classes normally only match *ONE* character.  To 
match more than one you can repeat, [a-z][a-z] will match TWO letters, or 
use frequency specifiers inside of {} as I did, above.  {1,3} would be 
one, two, or three matches, {1,} would be at least one match.

In addition to the {}-delimited frequency range specifiers, there's:

* = zero or more (*NOT* one or more, it doesn't have to be there!)
? = zero or one (may or may not be there, but matches only once)
+ = 1 or more

Again in case it didn't sink in above, \ is the escape char, so to match 
a literal *, you'd use \*

() are the grouping characters, and | indicates alternatives (or).  So 
((cat)|(horse)) will match "cat" or "horse" but will NOT match "cah", for 
instance.  Note that the alternatives do NOT need to be the same length, 
and that the inside grouping help clarify the scope of the match but 
aren't absolutely required, so (cat|horse) should have the same effect.  
So there are two ways to match a "cat" that may or may not be there:

(cat)?
(cat|)

That's the basics.  FWIW for non-pan usage, some regex uses make things 
like {} special characters, so {3} is a frequency and \{3\} are the 
literal characters, while others don't unless they're escaped, so {3} 
would be the literal characters and the backslash-escaped version would 
be frequency.  And of course the shell has its own special chars and \ 
escape char, so sometimes you need to play with the number of \\\ a bit 
in ordered to get it to work like you want, but once you understand the 
basics, even /just/ the basics, regex can really be quite powerful.

Of course there's far FAR more.  Just a couple quick examples.  First, () 
not only groups, but stores for later use.  So if for instance you are 
trying to match quotes but don't know if it's single-quotes or double-
quotes, you can use (['"]) for the first match (possibly as (['"])? or 
('|"|) if you don't know if it'll be quoted or not), and \1 or possibly 
$1 to automatically match the same thing at the other end of the quote.  
Second, there's what's called look-ahead and look-behind matching, which 
can be positive or negative.  So for instance if you want to match "pro" 
but not "gopro", there's a way to say "look behind (to the left of) the 
pro and don't match if the preceding letters are 'go'".  I don't use them 
enough to be sure of my memory, however, so generally have to look that 
sort of advanced stuff up, if I need it.  And for this advanced stuff, 
you usually have to either lookup or test whether whatever you're trying 
to work with actually supports it or not.  I'm not sure whether pan does, 
for instance, tho it wouldn't surprise me if it did.

So back to the specific case in point:

^[[:alpha:]]{3}_[[:digit:]]{2}.*$

Given the above, we can parse that as:

^ Left anchor (begin the line with what follows):

[[:alpha:]] one alphabet character

{3} match the previous exactly three times

_ (matches itself)

[[:digit:]] one digit

{2} match the previous exactly twice

. any character

* match the previous any number (including none) of times

$ right anchor (end of line)


Of course the .*$ aren't actually needed, since without them the match is 
simply left-anchored only, but I like the explicit "the rest of the line 
doesn't matter for the match" that .*$ provides.  And in non-pan usages 
where you're matching to delete or replace the match, it COULD matter, as 
failing to include the .*$ would leave any other junk on the line still 
there, while including it would match and thus delete/replace the entire 
line.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman




reply via email to

[Prev in Thread] Current Thread [Next in Thread]