[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Accepting [xyz---abc] - three minus signs to mean one
From: |
Arnold Robbins |
Subject: |
Accepting [xyz---abc] - three minus signs to mean one |
Date: |
Thu, 21 Apr 2022 10:57:45 +0300 |
User-agent: |
Heirloom mailx 12.5 6/20/10 |
Greetings.
Way back in May of 2015, Nelson Beebe submitted the following
bug report for gawk:
> Date: Mon, 25 May 2015 14:21:04 -0600 (MDT)
> From: "Nelson H. F. Beebe" <beebe@math.utah.edu>
> To: "Arnold Robbins" <arnold@skeeve.com>
> Cc: beebe@math.utah.edu
> Subject: gawk-4.1.3 regexp error
>
> I just ran an old (1996--date) awk program with gawk-4.1.3 and got an
> error that can be exhibited like this:
>
> % gawk '/[^0-9---]/ {print}'
> gawk: cmd. line:1: error: tent of \{\}: /[^0-9---]/
>
> As far as I can see, that is a perfectly valid range expression, and
> using three hyphens to represent one hyphen is the traditional way
> to incorporate a hyphen in the expression.
The upshot was that regex didn't support this, and I didn't (at the
time) want to tackle trying to fix it. (I did fix the error message,
at least.)
I submitted a bug report about it. At the time, Paul Eggert said the following:
> Date: Mon, 25 May 2015 23:53:31 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> To: arnold@skeeve.com, 20657@debbugs.gnu.org
> Subject: Re: bug#20657: Traditional range expression not accepted in regex/dfa
>
> arnold@skeeve.com wrote:
>
> > The bugaboo here is the "---"; it's
> > a range expression consisting of minus through minus, and apparently long
> > ago was how one got a minus into a bracket expression.
>
> Actually, long ago expressions like '[^0-9-]' worked just as they do now,
> and it wasn't ever necessary to use trailing "---". That being said,
> it is true that in 7th Edition Unix '[^0-9---]' meant the same thing as
> '[^0-9-]', so in that sense we have an incompatibility with 7th Edition
> Unix here.
>
> > $ ./src/grep '[^0-9---]' /dev/null
> > ./src/grep: Invalid range end
> >
> > The underlying regex and, I believe, dfa routines don't accept this.
>
> Yes, that's correct. It's not a bug, though, as the regexp is ambiguous
> and does not conform to POSIX, which says the following about RE
> bracket expressions: "To use a <hyphen> as the starting range point,
> it shall either come first in the bracket expression or be specified
> as a collating symbol; for example, "[][.-.]-0]", which matches either
> a <right-square-bracket> or any character or collating element that
> collates between <hyphen> and 0, inclusive."
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05>
>
> In your correspondent's example, the hyphen is a starting range point
> but is neither first in the bracket expression nor is specified as a
> collating symbol, so the regexp doesn't conform to POSIX.
>
> Even though it's not a bug I suppose it wouldn't hurt to make the GNU
> matchers compatible with 7th Edition Unix here, if someone really wants
> to take that task on; it's not urgent, though.
I had some time yesterday, and feeling brave and a little stronger in
The Force than usual, I came up the with the attached patch. It doesn't
break any of my tests.
As far as my testing indicates, dfa.c doesn't need a patch, it seems
to accept "---" inside brackets for a single minus.
If there are no objections, can we get this into Gnulib?
Thanks,
Arnold
3minus.diff
Description: Text Data
- Accepting [xyz---abc] - three minus signs to mean one,
Arnold Robbins <=