help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: inconstancy with RS = "(\r?\n){2}"


From: Wolfgang Laun
Subject: Re: inconstancy with RS = "(\r?\n){2}"
Date: Sun, 25 Jul 2021 15:04:33 +0200

I have been looking at the code in io.c and re.c.

gawk lets you specify an arbitrary regex as RS, the record separator. But
in an environment (terminal, socket) where the input data is not yet
available to  the gawk code looking for a match with RS, it is *in
general *impossible
to decide whether the full RS has been encountered or not unless some more
input has been entered. Of course, there are regexes where you can tell,
e.g. /ab?c/. But this becomes more and more difficult, e.g., when you have
parentheses and repetitions making the analysis rather complex. So, to be
on the safe side, gawk reads yet another line from the input source and
then passes another record to the user's code.

gawk is not a (soft) real time program and cannot react to all RS
immediately after they have been typed in on a TTY or sent over a line.

If you need this behavior, leave the default RS and implement a simple FSM
which is better equipped to handle RS like /(\r?\n){2}/.

The GAWK user manual might contain a paragraph describing what I have tried
to say in a previous paragraph, perhaps better formulated.

-W



On Sun, 25 Jul 2021 at 13:55, Alex fxmbsw7 Ratchev <fxmbsw7@gmail.com>
wrote:

> thank you for the true and detailed analyzement
>
> On Sun, Jul 25, 2021, 13:49 Ed Morton <mortoneccc@comcast.net> wrote:
>
>>
>>
>> On 7/25/2021 4:47 AM, arnold@skeeve.com wrote:
>>
>> Greetings.
>>
>> Thank you for taking the time to make a bug report. In the future please
>> send a concise description of the problem with a test program and data.
>> It was hard for me to determine what you really think is the bug.
>>
>> It looks like your concern is with the need to enter EOF more than
>> once from the terminal.
>>
>> Gawk is designed mainly for batch processing (from files or a pipe).
>> Reading from a terminal with a complicated regexp as RS isn't the
>> normal use case.  When RS is a regexp gawk may have to do lookahead in
>> the input stream to be sure that the regexp has matched, and thus
>> the need for multiple EOFs.
>>
>> In any case, I don't think there is an actual bug:
>>
>> $ od -c data
>> 0000000   a  \n  \n  \n   b  \n  \n  \n  \n   c  \n  \n  \n  \n   d  \n
>> 0000020
>> $ ./gawk -v RS='(\r?\n){2}' -v ORS='|\n' '{ print }' < data
>> a|
>>
>> b|
>> |
>> c|
>> |
>> d
>> |
>>
>> This looks right to me.
>>
>> Thanks,
>>
>> Arnold
>>
>>
>>
>> The problem occurs when reading from a terminal:
>>
>> Good (no \r? in RS), every pair of `\n`s is recognized:
>> ------------
>> $ gawk -v RS='(\n){2}' '{print "<"$0":"RT">"}'
>>
>>
>>
>> <:
>>
>> >
>>
>>
>> <:
>>
>> >
>>
>>
>> <:
>>
>> >
>> -----------------
>>
>> Bad (with \r? in RS), no RS is every recognized:
>> --------------
>> $ gawk -v RS='(\r?\n){2}' '{print "<"$0":"RT">"}'
>>
>>
>>
>>
>>
>>
>> -------------------
>>
>> Meanwhile if the input was coming from a pipe the RS including `\r?`
>> would be recognized:
>> ---------
>> $ printf '\n\n\n\n\n' | gawk -v RS='(\r?\n){2}' '{print "<"$0":"RT">"}'
>> <:
>>
>> >
>> <:
>>
>> >
>> <
>> :>
>> -----------
>>
>> Regards,
>>
>>     Ed.
>>
>

-- 
Wolfgang Laun


reply via email to

[Prev in Thread] Current Thread [Next in Thread]