bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk: {} repetition in patterns doesn't work?


From: Paul Eggert
Subject: Re: gawk: {} repetition in patterns doesn't work?
Date: Thu, 22 Mar 2001 12:01:56 -0800 (PST)

> Date: Thu, 22 Mar 2001 13:54:20 +0200
> From: Aharon Robbins <address@hidden>

> > gawk should do what GNU grep does: namely, support the POSIX
> > requirement only when it is absolutely required, and otherwise treat
> > stray braces as literal braces.
> 
> Gawk doesn't work this way, and I disagree that it should.  Right now,
> you must use one of --posix, --re-interval or setting POSIXLY_CORRECT
> in the environment to get interval expressions to work.  I think having
> /a{2}/  be an interval expression but  /{.*}/ be literal is confusing.

More confusing than having /a*/ be a closure expression but /*/ be
literal, which is what gawk currently does?  (:-)

I understand and sympathize with your desire to not support { by
default.  However, there is definitely some confusion here, on several
levels.  For starters, the gawk manual Introduction says:

  `gawk' is also upward compatible with the POSIX specification of the
  `awk' language.  This means that all properly written `awk' programs
  should work with `gawk'.

But tt's program is a counterexample to this statement.
That was why I suggested that gawk follow GNU egrep's lead.

For many years, GNU egrep also failed to conform to POSIX in this
area, because many users wanted egrep '{' to have its traditional
meaning.  However, this became more and more of a problem as
programmers began to depend on the POSIX-required semantics.  In 1999
GNU grep was changed to comply with POSIX when POSIX demands it, but
otherwise support the traditional behavior as an extension.  This
behavior is admittedly more complicated, but it falls squarely within
grep's (and awk's) tradition of treating "malformed" REs like /*/ as
literal characters, and it works quite well in practice.  I've never
heard of a traditional ERE being misinterpreted by GNU egrep's current
rule.

However, gawk is a bit trickier than GNU egrep was, because it has
further confusion in a different area: the use of POSIXLY_CORRECT.
With other GNU utilities, POSIXLY_CORRECT disables extensions that are
incompatible with POSIX.  However, gawk uses POSIXLY_CORRECT to
disable compatible extensions, which is quite a different animal.
This behavior is contrary to what the GNU coding standards say about
POSIXLY_CORRECT.

To add to the confusion, POSIXLY_CORRECT doesn't disable _all_ POSIX
extensions, only some of them.  For example, it doesn't disable /\`/
or /\A/, both of which are extensions to POSIX.  And --posix changes
the meaning of /*/ without disabling it (!).

All in all, there are plenty of confusion here:

* gawk and GNU egrep disagree about '{'.
* gawk disagrees with the rest of the world about POSIXLY_CORRECT.
* gawk's --posix option doesn't disable all POSIX extensions, only
  some of them, and sometimes it silently changes the meaning of
  extensions.

Here is a suggestion for how to fix things:

1. Change gawk to follow egrep's lead in supporting interval
   expressions only when POSIX requires it.  This will remove gawk's
   only incompatibility with POSIX.

2. Change gawk so that POSIXLY_CORRECT has no effect.  Combined with (1),
   this will make gawk conform to the GNU coding standards.

3. Change --posix so that it disables all extensions to POSIX regular
   expressions, not just some of them.

4. Remove the --re-interval option; it's no longer needed because of (1).

Other fixes are possible, but I think this is the simplest one.
If you like, I can propose a patch along these lines, but I thought
I'd get your feedback before doing any hacking in this area.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]