bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Use of '()' in a regexp


From: Ed Morton
Subject: Re: Use of '()' in a regexp
Date: Sat, 9 Jan 2021 14:46:03 -0600

Great, thanks again!

Ed Morton

> On Jan 9, 2021, at 2:44 PM, arnold@skeeve.com wrote:
> 
> Code and doc have been updated in git.
> 
> Arnold
> 
> Ed Morton <mortoneccc@comcast.net> wrote:
> 
>> Sounds good. From testing how `split()` and setting `FS` behave it looks 
>> like that rule applies to Field Separators in addition to Record 
>> Separators as that would explain theses differences:
>> 
>> $ printf 'foo\n' | awk '{print gsub(/()/,"X")}1'
>> 4
>> XfXoXoX
>> 
>> $ printf 'foo\n' | awk '{print split($0,a,/()/); for (i=1; i in a; i++) 
>> print a[i]}'
>> 1
>> foo
>> 
>> I don't think that's documented anywhere currently, it may be worth a 
>> brief statement in the manual, something like "if RS is a multi-char 
>> regexp populated such that it would match a null string (e.g. `RS='()'`) 
>> then ...." and an almost identical statement where field separator 
>> values are described?
>> 
>> Whatever you decide... thanks for quickly looking into and providing the 
>> fix and the explanation!
>> 
>>     Ed.
>> 
>>> On 1/7/2021 8:07 AM, arnold@skeeve.com wrote:
>>> The answer is "no".  Record separators must be non-null; the only exception
>>> where RT will be "" is at the end of a file.
>>> 
>>> This is also how Brian Kernighan's awk handles RS as a regexp.
>>> 
>>> Thanks,
>>> 
>>> Arnold
>>> 
>>> Ed Morton <mortoneccc@comcast.net> wrote:
>>> 
>>>> In case that's not an adequate example, what I mean is, will this:
>>>> 
>>>> $ printf 'foo\nbar\n' | awk -v RS='()' -v ORS='X' '1' file
>>>> 
>>>> then produce the same output as this:
>>>> 
>>>> $ printf 'foo\nbar\n' | awk -v RS='^$' '{gsub(/()/,"X")}1'
>>>> XfXoXoX
>>>> XbXaXrX
>>>> X
>>>> 
>>>> or not and, if not, why is it different?
>>>> 
>>>> I just noticed that this seems to handle `/()/` differently from either
>>>> of the current cases again:
>>>> 
>>>> $ printf 'foo\nbar\n' | awk '{nf=split($0,flds,/()/,seps); print nf; for
>>>> (i=0; i<=nf; i++) printf "%s%s", flds[i], "<"seps[i]">" ; print ""}'
>>>> 1
>>>> <>foo<>
>>>> 1
>>>> <>bar<>
>>>> 
>>>> Regards,
>>>> 
>>>>      Ed.
>>>> 
>>>> On 1/6/2021 2:54 PM, Ed Morton wrote:
>>>>> Great! Will that treat `()` when used in an RS:
>>>>> 
>>>>>     awk -v RS='()' -v ORS='x' '1'
>>>>> 
>>>>> the same as it's treated in a regexp in other contexts such as with
>>>>> gsub():
>>>>> 
>>>>>     awk -v ORS= '{gsub(/()/,"x")} 1'
>>>>> 
>>>>> or does it mean something different when used in an RS?
>>>>> 
>>>>>     Ed.
>>>>> 
>>>>> On 1/6/2021 1:33 PM, arnold@skeeve.com wrote:
>>>>>> Hi. Re this:
>>>>>> 
>>>>>> Ed Morton<mortoneccc@comcast.net>  wrote:
>>>>>> 
>>>>>>> Someone just pointed this out to me (gawk 5.1.0):
>>>>>>> 
>>>>>>> $ printf 'foo\n' | awk '{gsub(/()/,"x")} 1'
>>>>>>> xfxoxox
>>>>>>> 
>>>>>>> $ printf 'foo\n' | awk -v RS='()' -v ORS='x\n' '1'
>>>>>>> foox
>>>>>>> 
>>>>>>> Obviously that's a pretty ridiculous regexp but it still has me
>>>>>>> wondering - why does `gsub()` treat the regexp `()` as matching a null
>>>>>>> string around every character while `RS` treats it as if I'd asked it to
>>>>>>> match the `\n` at the end of the input:
>>>>>>> 
>>>>>>> $ printf 'foo\n' | awk -v RS='\n$' -v ORS='x\n' '1'
>>>>>>> foox
>>>>>>> 
>>>>>>> I could just file this under "don't write stupid regexps" but I was
>>>>>>> wondering if there's a more concrete, satisfying explanation of the
>>>>>>> behavior.
>>>>>>> 
>>>>>>>       Ed.
>>>>>> It's a bug. This appears to be the fix. It doesn't break the
>>>>>> test suite, either.
>>>>>> 
>>>>>> Thanks for the report!
>>>>>> 
>>>>>> Arnold
>>>>>> -----------------------------------------
>>>>>> diff --git a/io.c b/io.c
>>>>>> index 2714398e..0af8ab1e 100644
>>>>>> --- a/io.c
>>>>>> +++ b/io.c
>>>>>> @@ -3702,7 +3702,7 @@ again:
>>>>>>            * If still room in buffer, skip over null match
>>>>>>            * and restart search. Otherwise, return.
>>>>>>            */
>>>>>> -        if (bp + iop->scanoff < iop->dataend) {
>>>>>> +        if (bp + iop->scanoff <= iop->dataend) {
>>>>>>               bp += iop->scanoff;
>>>>>>               goto again;
>>>>>>           }
>> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]