bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: difference in RS handling for equivalent regexps with unending input


From: arnold
Subject: Re: difference in RS handling for equivalent regexps with unending input stream
Date: Wed, 03 Jul 2024 05:19:54 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Hi Ed.

There are two separate questions here. One is why are the different
regexps handled differently? The answer is I don't know, although I
can guess that since the third one is guaranteed to only match a
single character, the matching is little smarter.

The second question is why does gawk not parse all the input when
standard input remains open.  The reason is that it has to read ahead
a bit to be sure that it has completely matched the regular expression
and can tell where the definitive end of the record is.

You can try with mawk and the One True Awk and see if the behavior
is any different. Both of those allow RS to be a regexp.

I looked at the stack overflow post. Gawk has a read timeout mechanism
(see the manual, I don't remember the details) that will likely work
on pipes, sockets and terminals; that might do the trick, it might not.

In any case, there's no real bugs here, just limits as to what are
possible.

Thanks,

Arnold

Ed Morton <mortoneccc@comcast.net> wrote:

> Configuration Information [Automatically generated, do not change]:
> Machine: x86_64
> OS: cygwin
> Compiler: gcc
> Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security 
> -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong 
> --param=ssp-buffer-size=4 
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
>  
> -fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
>  
> -DNDEBUG
> uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64 
> 2024-04-03 17:25 UTC x86_64 Cygwin
> Machine Type: x86_64-pc-cygwin
>
> Gawk Version: 5.3.0
>
> Attestation 1:
>          I have read 
> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
>          Yes
>
> Attestation 2:
>          I have not modified the sources before building gawk.
>          True
>
> Description:
>
>     Someone asked a question on SO about handling unending input from
>     netcat with a regexp delimiter that's just 2 possible chars, see
>     https://stackoverflow.com/q/78700014/1745001, where gawk seems to be
>     a record behind in it's processing. I'm using bash on cygwin, they
>     used zsh on MacOS.
>
> Repeat-By:
>
>     I can reproduce the problem with this (hitting control-C to stop
>     each command when it stops to wait for more input):
>
>     $ printf 'A;B;C;\n' > file
>
>     $ cat file - | awk -v RS='(;|=)' '{print NR, $0}'
>     1 A
>
>     $ cat file - | awk -v RS=';|=' '{print NR, $0}'
>     1 A
>     2 B
>
>     $ cat file - | awk -v RS='[;=]' '{print NR, $0}'
>     1 A
>     2 B
>     3 C
>
>     Obviously that's 3 supposedly equivalent regexps producing 3
>     different results.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]