bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gawk] Unexpected results with RS="."


From: Ed Morton
Subject: [bug-gawk] Unexpected results with RS="."
Date: Sun, 10 Jun 2018 11:28:12 -0500
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

I was recently surprised by this behavior from gawk 4.2.0:

   $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
   1 <foo
   :>

I came across this because I was trying to process data 1 char at a time and thought setting RT to 1 char at a time might be a valid approach rather than writing a loop. I'm not looking for alternatives, just wondering about this specific functionality.

A little investigation shows that it behaves as if I'd used RS='[.]':

   $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
   1 <foo:.>
   2 <bar
   :>

I expected that RT would take the values f, o, o, \n and every $0 would be the null string, analogous to what happens when you use 2 "."s:

   $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
   1 <:fo>
   2 <:o
   >

I assume it does this for compatibility with other awks where a single char RS is always just that literal character but that seems counter-intuitive to the way gawk uses RS as a regexp otherwise and idk how we're supposed to set the RS to "any single character" given this implementation whereas if RS="." was interpreted as a normal regexp then we could use `RS="[.]"` to get a literal "." just like we do for it in any other regexp context.

I've since discovered that I can get the behavior I want with `RS=".{1}"` or `RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and non-intuitive.

I can't find anything in the gawk documentation that states that the above is expected behavior. Assuming we can't update the code to treat RS="."  as if "." is a regexp metacharacter for backward compatibility, can we get a statement saying something clear like "If RS is a single character it will be treated as a literal character and not a regexp metacharacter" added to the documentation and also the example of RS=".{1}" shown as a workaround for the case where the desired regexp is "a single occurrence of any character"? I can't think of any other regexp metacharacter that this issue would apply to.

     Ed.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]