[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Unexpected results with RS="."
From: |
arnold |
Subject: |
Re: [bug-gawk] Unexpected results with RS="." |
Date: |
Mon, 11 Jun 2018 00:07:56 -0600 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
Hi Ed.
The behavior is stated clearly, if tersely, in the summary section in the
chapter
on reading input
(https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):
Input is split into records based on the value of RS. The possibilities
are as follows:
Value of RS Records are split on … awk / gawk
Any single character That character awk
The empty string ("") Runs of two or more newlines awk
A regexp Text that matches the regexp gawk
Thanks,
Arnold
Ed Morton <address@hidden> wrote:
> I was recently surprised by this behavior from gawk 4.2.0:
>
> $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
> 1 <foo
> :>
>
> I came across this because I was trying to process data 1 char at a time and
> thought setting RT to 1 char at a time might be a valid approach rather than
> writing a loop. I'm not looking for alternatives, just wondering about this
> specific functionality.
>
> A little investigation shows that it behaves as if I'd used RS='[.]':
>
> $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
> 1 <foo:.>
> 2 <bar
> :>
>
> I expected that RT would take the values f, o, o, \n and every $0 would be
> the
> null string, analogous to what happens when you use 2 "."s:
>
> $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
> 1 <:fo>
> 2 <:o
> >
>
> I assume it does this for compatibility with other awks where a single char
> RS
> is always just that literal character but that seems counter-intuitive to the
> way gawk uses RS as a regexp otherwise and idk how we're supposed to set the
> RS
> to "any single character" given this implementation whereas if RS="." was
> interpreted as a normal regexp then we could use `RS="[.]"` to get a literal
> "."
> just like we do for it in any other regexp context.
>
> I've since discovered that I can get the behavior I want with `RS=".{1}"` or
> `RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and
> non-intuitive.
>
> I can't find anything in the gawk documentation that states that the above is
> expected behavior. Assuming we can't update the code to treat RS="." as if
> "."
> is a regexp metacharacter for backward compatibility, can we get a statement
> saying something clear like "If RS is a single character it will be treated
> as a
> literal character and not a regexp metacharacter" added to the documentation
> and
> also the example of RS=".{1}" shown as a workaround for the case where the
> desired regexp is "a single occurrence of any character"? I can't think of
> any
> other regexp metacharacter that this issue would apply to.
>
> Ed.