bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Unexpected results with RS="."


From: arnold
Subject: Re: [bug-gawk] Unexpected results with RS="."
Date: Mon, 11 Jun 2018 00:07:56 -0600
User-agent: Heirloom mailx 12.4 7/29/08

Hi Ed.

The behavior is stated clearly, if tersely, in the summary section in the 
chapter
on reading input 
(https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):


        Input is split into records based on the value of RS. The possibilities 
are as follows:

        Value of RS             Records are split on …          awk / gawk
        Any single character    That character                  awk
        The empty string ("")   Runs of two or more newlines    awk
        A regexp                Text that matches the regexp    gawk

Thanks,

Arnold


Ed Morton <address@hidden> wrote:

> I was recently surprised by this behavior from gawk 4.2.0:
>
>     $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
>     1 <foo
>     :>
>
> I came across this because I was trying to process data 1 char at a time and 
> thought setting RT to 1 char at a time might be a valid approach rather than 
> writing a loop. I'm not looking for alternatives, just wondering about this 
> specific functionality.
>
> A little investigation shows that it behaves as if I'd used RS='[.]':
>
>     $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
>     1 <foo:.>
>     2 <bar
>     :>
>
> I expected that RT would take the values f, o, o, \n and every $0 would be 
> the 
> null string, analogous to what happens when you use 2 "."s:
>
>     $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
>     1 <:fo>
>     2 <:o
>     >
>
> I assume it does this for compatibility with other awks where a single char 
> RS 
> is always just that literal character but that seems counter-intuitive to the 
> way gawk uses RS as a regexp otherwise and idk how we're supposed to set the 
> RS 
> to "any single character" given this implementation whereas if RS="." was 
> interpreted as a normal regexp then we could use `RS="[.]"` to get a literal 
> "." 
> just like we do for it in any other regexp context.
>
> I've since discovered that I can get the behavior I want with `RS=".{1}"` or 
> `RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and 
> non-intuitive.
>
> I can't find anything in the gawk documentation that states that the above is 
> expected behavior. Assuming we can't update the code to treat RS="."  as if 
> "." 
> is a regexp metacharacter for backward compatibility, can we get a statement 
> saying something clear like "If RS is a single character it will be treated 
> as a 
> literal character and not a regexp metacharacter" added to the documentation 
> and 
> also the example of RS=".{1}" shown as a workaround for the case where the 
> desired regexp is "a single occurrence of any character"? I can't think of 
> any 
> other regexp metacharacter that this issue would apply to.
>
>       Ed.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]