bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use of '()' in a regexp


From: arnold
Subject: Re: Use of '()' in a regexp
Date: Sat, 09 Jan 2021 13:44:18 -0700
User-agent: Heirloom mailx 12.5 7/5/10

Code and doc have been updated in git.

Arnold

Ed Morton <mortoneccc@comcast.net> wrote:

> Sounds good. From testing how `split()` and setting `FS` behave it looks 
> like that rule applies to Field Separators in addition to Record 
> Separators as that would explain theses differences:
>
> $ printf 'foo\n' | awk '{print gsub(/()/,"X")}1'
> 4
> XfXoXoX
>
> $ printf 'foo\n' | awk '{print split($0,a,/()/); for (i=1; i in a; i++) 
> print a[i]}'
> 1
> foo
>
> I don't think that's documented anywhere currently, it may be worth a 
> brief statement in the manual, something like "if RS is a multi-char 
> regexp populated such that it would match a null string (e.g. `RS='()'`) 
> then ...." and an almost identical statement where field separator 
> values are described?
>
> Whatever you decide... thanks for quickly looking into and providing the 
> fix and the explanation!
>
>      Ed.
>
> On 1/7/2021 8:07 AM, arnold@skeeve.com wrote:
> > The answer is "no".  Record separators must be non-null; the only exception
> > where RT will be "" is at the end of a file.
> >
> > This is also how Brian Kernighan's awk handles RS as a regexp.
> >
> > Thanks,
> >
> > Arnold
> >
> > Ed Morton <mortoneccc@comcast.net> wrote:
> >
> >> In case that's not an adequate example, what I mean is, will this:
> >>
> >> $ printf 'foo\nbar\n' | awk -v RS='()' -v ORS='X' '1' file
> >>
> >> then produce the same output as this:
> >>
> >> $ printf 'foo\nbar\n' | awk -v RS='^$' '{gsub(/()/,"X")}1'
> >> XfXoXoX
> >> XbXaXrX
> >> X
> >>
> >> or not and, if not, why is it different?
> >>
> >> I just noticed that this seems to handle `/()/` differently from either
> >> of the current cases again:
> >>
> >> $ printf 'foo\nbar\n' | awk '{nf=split($0,flds,/()/,seps); print nf; for
> >> (i=0; i<=nf; i++) printf "%s%s", flds[i], "<"seps[i]">" ; print ""}'
> >> 1
> >> <>foo<>
> >> 1
> >> <>bar<>
> >>
> >> Regards,
> >>
> >>       Ed.
> >>
> >> On 1/6/2021 2:54 PM, Ed Morton wrote:
> >>> Great! Will that treat `()` when used in an RS:
> >>>
> >>>      awk -v RS='()' -v ORS='x' '1'
> >>>
> >>> the same as it's treated in a regexp in other contexts such as with
> >>> gsub():
> >>>
> >>>      awk -v ORS= '{gsub(/()/,"x")} 1'
> >>>
> >>> or does it mean something different when used in an RS?
> >>>
> >>>      Ed.
> >>>
> >>> On 1/6/2021 1:33 PM, arnold@skeeve.com wrote:
> >>>> Hi. Re this:
> >>>>
> >>>> Ed Morton<mortoneccc@comcast.net>  wrote:
> >>>>
> >>>>> Someone just pointed this out to me (gawk 5.1.0):
> >>>>>
> >>>>> $ printf 'foo\n' | awk '{gsub(/()/,"x")} 1'
> >>>>> xfxoxox
> >>>>>
> >>>>> $ printf 'foo\n' | awk -v RS='()' -v ORS='x\n' '1'
> >>>>> foox
> >>>>>
> >>>>> Obviously that's a pretty ridiculous regexp but it still has me
> >>>>> wondering - why does `gsub()` treat the regexp `()` as matching a null
> >>>>> string around every character while `RS` treats it as if I'd asked it to
> >>>>> match the `\n` at the end of the input:
> >>>>>
> >>>>> $ printf 'foo\n' | awk -v RS='\n$' -v ORS='x\n' '1'
> >>>>> foox
> >>>>>
> >>>>> I could just file this under "don't write stupid regexps" but I was
> >>>>> wondering if there's a more concrete, satisfying explanation of the
> >>>>> behavior.
> >>>>>
> >>>>>        Ed.
> >>>> It's a bug. This appears to be the fix. It doesn't break the
> >>>> test suite, either.
> >>>>
> >>>> Thanks for the report!
> >>>>
> >>>> Arnold
> >>>> -----------------------------------------
> >>>> diff --git a/io.c b/io.c
> >>>> index 2714398e..0af8ab1e 100644
> >>>> --- a/io.c
> >>>> +++ b/io.c
> >>>> @@ -3702,7 +3702,7 @@ again:
> >>>>                   * If still room in buffer, skip over null match
> >>>>                   * and restart search. Otherwise, return.
> >>>>                   */
> >>>> -                if (bp + iop->scanoff < iop->dataend) {
> >>>> +                if (bp + iop->scanoff <= iop->dataend) {
> >>>>                          bp += iop->scanoff;
> >>>>                          goto again;
> >>>>                  }
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]