[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Support for '.*?' meaning leftmost-shortest per latest POSIX ERE spe
From: |
arnold |
Subject: |
Re: Support for '.*?' meaning leftmost-shortest per latest POSIX ERE spec |
Date: |
Tue, 20 Aug 2024 05:02:38 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
OK, much thanks.
Ed Morton <mortoneccc@comcast.net> wrote:
> I posted that question at bug-gnulib yesterday:
>
> https://lists.gnu.org/archive/html/bug-gnulib/2024-08/msg00122.html
>
> No response yet, I'll let you know if/when I hear anything.
>
> Ed.
>
> On 8/20/2024 1:41 AM, arnold@skeeve.com wrote:
> > It adds considerable complexity into the regexp matchers. Doing this is
> > (way) beyond my capabilities.
> >
> > Please ask the Gnulib guys about it (and let me know what they say).
> > As all of gawk, GNU grep and GNU sed use the routines from Gnulib,
> > this feature won't be available until they add it.
> >
> > Arnold
> >
> > Ed Morton<mortoneccc@comcast.net> wrote:
> >
> >> Thanks for the quick response, just curious - what makes it a bad
> >> addition, is it extra complexity or worse performance or something else?
> >>
> >> Ed.
> >>
> >> On 8/19/2024 8:26 AM,arnold@skeeve.com wrote:
> >>> Hi.
> >>>
> >>> I am aware of it. Support for this feature can't happen unless and until
> >>> GNU regex and GNU dfa, which are both part of Gnulib, support it.
> >>> So you might consider asking on the bug-gnulib list what their plans
> >>> are for it.
> >>>
> >>> EVEN if those libraries support this feature, I may not add it;
> >>> I think this was a bad addition, and I'm quite certain that it's not on
> >>> the radar screen for almost any other version of awk.
> >>>
> >>> Realistically, I wouldn't expect to see this appear any time soon.
> >>>
> >>> Arnold
> >>>
> >>> Ed Morton via "Bug reports only for gawk."<bug-gawk@gnu.org> wrote:
> >>>
> >>>> Configuration Information [Automatically generated, do not change]:
> >>>> Machine: x86_64
> >>>> OS: cygwin
> >>>> Compiler: gcc
> >>>> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
> >>>> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
> >>>> --param=ssp-buffer-size=4
> >>>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
> >>>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
> >>>> -DNDEBUG
> >>>> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64
> >>>> 2024-04-03 17:25 UTC x86_64 Cygwin
> >>>> Machine Type: x86_64-pc-cygwin
> >>>>
> >>>> Gawk Version: 5.3.0
> >>>>
> >>>> Attestation 1:
> >>>> I have read
> >>>> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> >>>> Yes
> >>>>
> >>>> Attestation 2:
> >>>> I have not modified the sources before building gawk.
> >>>> True
> >>>>
> >>>> Description:
> >>>> The latest POSIX ERE spec
> >>>> (https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_04_06)
> >>>> says:
> >>>> ----
> >>>> Each of the duplication symbols ('+', '*', '?', and intervals) can
> >>>> be suffixed by the repetition modifier '?' (<question-mark>), in which
> >>>> case matching behavior for that repetition shall be changed from the
> >>>> leftmost longest possible match to the leftmost shortest possible match,
> >>>> including the null match (see A.9 Regular Expressions ). For example,
> >>>> the ERE ".*c" matches up to and including the last character ('c') in
> >>>> the string "abc abc", whereas the ERE ".*?c" matches up to and including
> >>>> the first character 'c', the third character in the string.
> >>>> ----
> >>>> Gawk doesn't do that (yet) but I assume you're already aware of it
> >>>> and so this is probably more of a "do you plan to support it and, if so,
> >>>> what's the current target release?" than a real bug report.
> >>>>
> >>>> Repeat-By:
> >>>>
> >>>> $ echo 'abc abc' | awk '{sub(/b.*c/,"")} 1'
> >>>> a
> >>>>
> >>>> $ echo 'abc abc' | awk '{sub(/b.*?c/,"")} 1'
> >>>> a
> >>>>
> >>>> "1" above is correct but "2" should output "a abc" per the new
> >>>> POSIX spec.