[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Support for '.*?' meaning leftmost-shortest per latest POSIX ERE spe
From: |
arnold |
Subject: |
Re: Support for '.*?' meaning leftmost-shortest per latest POSIX ERE spec |
Date: |
Tue, 20 Aug 2024 00:41:32 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
It adds considerable complexity into the regexp matchers. Doing this is
(way) beyond my capabilities.
Please ask the Gnulib guys about it (and let me know what they say).
As all of gawk, GNU grep and GNU sed use the routines from Gnulib,
this feature won't be available until they add it.
Arnold
Ed Morton <mortoneccc@comcast.net> wrote:
> Thanks for the quick response, just curious - what makes it a bad
> addition, is it extra complexity or worse performance or something else?
>
> Ed.
>
> On 8/19/2024 8:26 AM, arnold@skeeve.com wrote:
> > Hi.
> >
> > I am aware of it. Support for this feature can't happen unless and until
> > GNU regex and GNU dfa, which are both part of Gnulib, support it.
> > So you might consider asking on the bug-gnulib list what their plans
> > are for it.
> >
> > EVEN if those libraries support this feature, I may not add it;
> > I think this was a bad addition, and I'm quite certain that it's not on
> > the radar screen for almost any other version of awk.
> >
> > Realistically, I wouldn't expect to see this appear any time soon.
> >
> > Arnold
> >
> > Ed Morton via "Bug reports only for gawk."<bug-gawk@gnu.org> wrote:
> >
> >> Configuration Information [Automatically generated, do not change]:
> >> Machine: x86_64
> >> OS: cygwin
> >> Compiler: gcc
> >> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
> >> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
> >> --param=ssp-buffer-size=4
> >> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
> >> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
> >> -DNDEBUG
> >> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64
> >> 2024-04-03 17:25 UTC x86_64 Cygwin
> >> Machine Type: x86_64-pc-cygwin
> >>
> >> Gawk Version: 5.3.0
> >>
> >> Attestation 1:
> >> I have read
> >> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> >> Yes
> >>
> >> Attestation 2:
> >> I have not modified the sources before building gawk.
> >> True
> >>
> >> Description:
> >> The latest POSIX ERE spec
> >> (https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_04_06)
> >> says:
> >> ----
> >> Each of the duplication symbols ('+', '*', '?', and intervals) can
> >> be suffixed by the repetition modifier '?' (<question-mark>), in which
> >> case matching behavior for that repetition shall be changed from the
> >> leftmost longest possible match to the leftmost shortest possible match,
> >> including the null match (see A.9 Regular Expressions ). For example,
> >> the ERE ".*c" matches up to and including the last character ('c') in
> >> the string "abc abc", whereas the ERE ".*?c" matches up to and including
> >> the first character 'c', the third character in the string.
> >> ----
> >> Gawk doesn't do that (yet) but I assume you're already aware of it
> >> and so this is probably more of a "do you plan to support it and, if so,
> >> what's the current target release?" than a real bug report.
> >>
> >> Repeat-By:
> >>
> >> $ echo 'abc abc' | awk '{sub(/b.*c/,"")} 1'
> >> a
> >>
> >> $ echo 'abc abc' | awk '{sub(/b.*?c/,"")} 1'
> >> a
> >>
> >> "1" above is correct but "2" should output "a abc" per the new
> >> POSIX spec.