[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
gensub RE problem?
From: |
Jim Hart |
Subject: |
gensub RE problem? |
Date: |
Fri, 6 Sep 2002 09:37:07 -0400 |
GNU Awk 3.1.0, compiled from source on Darwin.
I believe I've found a problem in gensub's handling of regular
expression matching. The string to be matched against is:
<div class="bodytext"> <a href="/2
/hi/science/nature/2212629.stm"><img height="120" hspace="5" vspace="0"
width="1
00" border="0" src="/media/images/38213000/jpg/_38213850_websmall.jpg"
align="le
ft"></a>
<a href=
"/2/hi/science/nature/2212629.stm"><span class="h1">Global body needed
to fight
poverty</span></a><br> A new global body for economic
developme
nt is needed, says The Lancet in the lead-up to the Sustainable
Development Summ
it.<br clear="ALL"> </div>
The gensub command:
gensub(/.*<br>([^<]*)</br.*|.*/,"\\1",1,itemString)
returns:
A new global body for economic developme
nt is needed, says The Lancet in the lead-up to the Sustainable
Development Summ
it.
as one would expect. Whereas:
gensub(/.*<br>(.*)</br.*|.*/,"\\1",1,itemString)
returns null. And:
gensub(/.*<br>( *)([^<]*)</br.*|.*/,"\\1",1,itemString)
returns only one space, not the many that follow the <br>. And:
gensub(/.*<br>( *)([^<]*)</br.*|.*/,"\\2",1,itemString)
returns the same thing as the first example, with all the leading spaces
included.
The man page re_format says:
"In the event that an RE could match more than one sub-
string of a given string, the RE matches the one starting
earliest in the string. If the RE could match more than
one substring starting at that point, it matches the
longest. Subexpressions also match the longest possible
substrings, subject to the constraint that the whole match
be as long as possible, with subexpressions starting ear-
lier in the RE taking priority over ones starting later."
Note that last part. The earlier subexpression should be extending to
maximum length. Yet, it appears that gensub is returning the shortest
possible match for ( *) and (.*), not the longest.
Comments? Opinions? Did I miss something? Is gawk just calling OS
routines so the problem is actually in Darwin?
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- gensub RE problem?,
Jim Hart <=