[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gawk/POSIX regex metacharacter bug
From: |
Aharon Robbins |
Subject: |
Re: gawk/POSIX regex metacharacter bug |
Date: |
Thu, 10 Jun 2004 14:11:32 +0300 |
Greetings.
I am catching up on my gawk work. In regards the email quoted below,
I note the following text in the gawk-3.1.3/doc/gawk.texi, starting at
line 3194:
| @cindex interval expressions
| @item @address@hidden@}
| @itemx @address@hidden,@}
| @itemx @address@hidden,@address@hidden
| One or two numbers inside braces denote an @dfn{interval expression}.
| If there is one number in the braces, the preceding regexp is repeated
| @var{n} times.
| If there are two numbers separated by a comma, the preceding regexp is
| repeated @var{n} to @var{m} times.
| If there is one number followed by a comma, then the preceding regexp
| is repeated at least @var{n} times:
|
| @table @code
| @item address@hidden@}y
| Matches @samp{whhhy}, but not @samp{why} or @samp{whhhhy}.
|
| @item address@hidden,address@hidden
| Matches @samp{whhhy}, @samp{whhhhy}, or @samp{whhhhhy}, only.
|
| @item address@hidden,@}y
| Matches @samp{whhy} or @samp{whhhy}, and so on.
| @end table
|
| @cindex POSIX @command{awk}, interval expressions in
| Interval expressions were not traditionally available in @command{awk}.
| They were added as part of the POSIX standard to make @command{awk}
| and @command{egrep} consistent with each other.
|
| @cindex @command{gawk}, interval expressions and
| However, because old programs may use @address@hidden and @address@hidden in
regexp
| constants, by default @command{gawk} does @emph{not} match interval
expressions
| in regexps. If either @option{--posix} or @option{--re-interval} are
specified
| (@pxref{Options}), then interval expressions
| are allowed in regexps.
|
| For new programs that use @address@hidden and @address@hidden in regexp
constants,
| it is good practice to always escape them with a backslash. Then the
| regexp constants are valid and work the way you want them to, using
| any version of @address@hidden two backslashes if you're
| using a string constant with a regexp operator or function.}
This seems pretty clear to me. What about this makes it seem that
gawk's treatment of { and } depending upon the use/absence of --posix is
"undocumented"?
Arnold Robbins
> Date: Sun, 28 Sep 2003 13:36:04 -0700 (PDT)
> From: Shawn Smout <address@hidden>
> Subject: gawk/POSIX regex metacharacter bug
> To: address@hidden
>
> I am running Slackware 9.1 with Linux 2.4.22 on a
> Pentium 4. My gawk version is 3.1.3, and I am
> reasonably certain it was compiled with gcc 3.2.3.
>
> Gawk apparently handles metacharacters specially based
> on context normally, but does not in POSIX
> compatability mode. This is not listed in the
> documentation (info or man) as one of the POSIX/GNU
> differences.
>
> For this example, the file "file" contains one line:
> {s}
>
> Ordinarily,
> gawk '/{.}/' file
> will print:
> {s}
> However,
> gawk --posix '/{.}/' file
> fails with an invalid regular expression error.
> Apparently gawk normally decides based on context
> whether the {} characters are metacharacters or
> literal characters; since they are not valid as
> metacharacters in this example, gawk interprets them
> as literal characters. In POSIX mode, gawk does not
> change its interpretation of the metacharacters based
> on context.
>
> The correct POSIX awk syntax is
> awk '/\{.\}/' file
> with the metacharacters escaped so they are
> interpreted as literals. This prints
> {s}
> This syntax works in gawk in both normal and POSIX
> modes.
>
> The problem here is not the discrepancy between normal
> and POSIX modes; I am fully aware that most such
> discrepancies are deliberate. However, this
> particular one is not documented, which is a major
> problem. I discovered this gawk issue while compiling
> third-party software (specifically, the ALSA drivers)
> that uses gawk. I had the POSIXLY_CORRECT environment
> variable set, which causes gawk to behave in POSIX
> mode, and the compilation failed; it took me a long
> time to figure out why. This problem may never have
> existed if the discrepancy was documented; even if it
> did exist, it would then become the fault of the
> developers for either (a) not checking the
> documentation and making sure their code was
> compatible with either mode of gawk, or (b) not
> informing the user that gawk needed to run in
> non-compatible mode. However, it was not documented,
> so there was nothing the developers could have done
> about it.
>
> It is bad enough that so much GNU software allows lax
> syntax like this. Allowing context-based
> interpretation of metacharacters doesn't add any
> functionality at all, because the developer can always
> escape the metacharacters to achieve the same result;
> it only allows harmful ambiguity, which in turn causes
> hard-to-find bugs that never should have been there to
> start out with. If we are ever to have good bug-free
> code, we should try to eliminate ambiguity, not
> promote it. However, I would consider the ambiguity
> tolerable in the software of others who choose to use
> it, if it were documented properly.
>
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Shopping - with improved product search
> http://shopping.yahoo.com
- Re: gawk/POSIX regex metacharacter bug,
Aharon Robbins <=