bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Unexpected results with RS="."


From: Ed Morton
Subject: Re: [bug-gawk] Unexpected results with RS="."
Date: Mon, 11 Jun 2018 05:43:52 -0500
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

Arnold - thanks for responding. I don't agree that is clear as that section doesn't state that the 3 possibilities are considered in that order, it sounds like they would just be mutually exclusive but of course they aren't when it come to RS=".", so what happens in gawk when the single char is a regexp is ambiguous if that's the only statement about the behavior, but in any case I didn't even look at the Summary section as I expected to find everything I needed related to this in the main section, 4.1 How Input Is Split into Records (https://www.gnu.org/software/gawk/manual/gawk.html#Records).

Since a Summary should be just that I'd have expect this particular information in section 4.14 should be summarized from section 4.1, not additional to it. What's stated in 4.14 is fine as a summary, but not adequate if it's the ONLY source of info on this. It also doesn't explain how to get an RS that means "any single character" and IMHO that is non-obvious (embarrassingly, I had to ask at comp.lang.awk where Janis helped me wrap my head around it as I was drawing a blank!).

I see now there's a clear statement of the related behavior for FS in section 4.5 Specifying How Fields Are Separated (https://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators):
If FS is any other single character, such as ",", then each occurrence of that character separates two fields. Two consecutive occurrences delimit an empty field. If the character occurs at the beginning or the end of the line, that too delimits an empty field. The space character is the only single character that does not follow these rules.
I think RS deserves the equivalent explanation in section 4.1 plus the example of using an RS that's any char (FS doesn't need it since there's no equivalent to RT that's be useful in this case and FPAT="." works as you'd expect so there's no use case for FS="." as a regexp).

    Ed.

On 6/11/2018 1:07 AM, address@hidden wrote:
Hi Ed.

The behavior is stated clearly, if tersely, in the summary section in the chapter
on reading input (https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):


	Input is split into records based on the value of RS. The possibilities are as follows:

	Value of RS		Records are split on …		awk / gawk
	Any single character	That character			awk
	The empty string ("")	Runs of two or more newlines	awk
	A regexp		Text that matches the regexp	gawk

Thanks,

Arnold


Ed Morton <address@hidden> wrote:

I was recently surprised by this behavior from gawk 4.2.0:

    $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
    1 <foo
    :>

I came across this because I was trying to process data 1 char at a time and 
thought setting RT to 1 char at a time might be a valid approach rather than 
writing a loop. I'm not looking for alternatives, just wondering about this 
specific functionality.

A little investigation shows that it behaves as if I'd used RS='[.]':

    $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
    1 <foo:.>
    2 <bar
    :>

I expected that RT would take the values f, o, o, \n and every $0 would be the 
null string, analogous to what happens when you use 2 "."s:

    $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
    1 <:fo>
    2 <:o
    >

I assume it does this for compatibility with other awks where a single char RS 
is always just that literal character but that seems counter-intuitive to the 
way gawk uses RS as a regexp otherwise and idk how we're supposed to set the RS 
to "any single character" given this implementation whereas if RS="." was 
interpreted as a normal regexp then we could use `RS="[.]"` to get a literal "." 
just like we do for it in any other regexp context.

I've since discovered that I can get the behavior I want with `RS=".{1}"` or 
`RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and non-intuitive.

I can't find anything in the gawk documentation that states that the above is 
expected behavior. Assuming we can't update the code to treat RS="."  as if "." 
is a regexp metacharacter for backward compatibility, can we get a statement 
saying something clear like "If RS is a single character it will be treated as a 
literal character and not a regexp metacharacter" added to the documentation and 
also the example of RS=".{1}" shown as a workaround for the case where the 
desired regexp is "a single occurrence of any character"? I can't think of any 
other regexp metacharacter that this issue would apply to.

      Ed.

    


reply via email to

[Prev in Thread] Current Thread [Next in Thread]