bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support for '.*?' meaning leftmost-shortest per latest POSIX ERE spe


From: arnold
Subject: Re: Support for '.*?' meaning leftmost-shortest per latest POSIX ERE spec
Date: Tue, 20 Aug 2024 05:02:38 -0600
User-agent: Heirloom mailx 12.5 7/5/10

OK, much thanks.

Ed Morton <mortoneccc@comcast.net> wrote:

> I posted that question at bug-gnulib yesterday:
>
> https://lists.gnu.org/archive/html/bug-gnulib/2024-08/msg00122.html
>
> No response yet, I'll let you know if/when I hear anything.
>
>      Ed.
>
> On 8/20/2024 1:41 AM, arnold@skeeve.com wrote:
> > It adds considerable complexity into the regexp matchers. Doing this is
> > (way) beyond my capabilities.
> >
> > Please ask the Gnulib guys about it (and let me know what they say).
> > As all of gawk, GNU grep and GNU sed use the routines from Gnulib,
> > this feature won't be available until they add it.
> >
> > Arnold
> >
> > Ed Morton<mortoneccc@comcast.net>  wrote:
> >
> >> Thanks for the quick response, just curious - what makes it a bad
> >> addition, is it extra complexity or worse performance or something else?
> >>
> >>       Ed.
> >>
> >> On 8/19/2024 8:26 AM,arnold@skeeve.com  wrote:
> >>> Hi.
> >>>
> >>> I am aware of it.  Support for this feature can't happen unless and until
> >>> GNU regex and GNU dfa, which are both part of Gnulib, support it.
> >>> So you might consider asking on the bug-gnulib list what their plans
> >>> are for it.
> >>>
> >>> EVEN if those libraries support this feature, I may not add it;
> >>> I think this was a bad addition, and I'm quite certain that it's not on
> >>> the radar screen for almost any other version of awk.
> >>>
> >>> Realistically, I wouldn't expect to see this appear any time soon.
> >>>
> >>> Arnold
> >>>
> >>> Ed Morton via "Bug reports only for gawk."<bug-gawk@gnu.org>   wrote:
> >>>
> >>>> Configuration Information [Automatically generated, do not change]:
> >>>> Machine: x86_64
> >>>> OS: cygwin
> >>>> Compiler: gcc
> >>>> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
> >>>> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
> >>>> --param=ssp-buffer-size=4
> >>>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
> >>>> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
> >>>> -DNDEBUG
> >>>> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64
> >>>> 2024-04-03 17:25 UTC x86_64 Cygwin
> >>>> Machine Type: x86_64-pc-cygwin
> >>>>
> >>>> Gawk Version: 5.3.0
> >>>>
> >>>> Attestation 1:
> >>>>        I have read
> >>>> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> >>>>        Yes
> >>>>
> >>>> Attestation 2:
> >>>>        I have not modified the sources before building gawk.
> >>>>        True
> >>>>
> >>>> Description:
> >>>>        The latest POSIX ERE spec
> >>>> (https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_04_06)
> >>>> says:
> >>>>        ----
> >>>>        Each of the duplication symbols ('+', '*', '?', and intervals) can
> >>>> be suffixed by the repetition modifier '?' (<question-mark>), in which
> >>>> case matching behavior for that repetition shall be changed from the
> >>>> leftmost longest possible match to the leftmost shortest possible match,
> >>>> including the null match (see A.9 Regular Expressions ). For example,
> >>>> the ERE ".*c" matches up to and including the last character ('c') in
> >>>> the string "abc abc", whereas the ERE ".*?c" matches up to and including
> >>>> the first character 'c', the third character in the string.
> >>>>        ----
> >>>>        Gawk doesn't do that (yet) but I assume you're already aware of it
> >>>> and so this is probably more of a "do you plan to support it and, if so,
> >>>> what's the current target release?" than a real bug report.
> >>>>
> >>>> Repeat-By:
> >>>>
> >>>>        $ echo 'abc abc' | awk '{sub(/b.*c/,"")} 1'
> >>>>        a
> >>>>
> >>>>        $ echo 'abc abc' | awk '{sub(/b.*?c/,"")} 1'
> >>>>        a
> >>>>
> >>>>        "1" above is correct but "2" should output "a abc" per the new
> >>>> POSIX spec.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]