bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Unexpected results with RS="."


From: arnold
Subject: Re: [bug-gawk] Unexpected results with RS="."
Date: Mon, 11 Jun 2018 11:44:12 -0600
User-agent: Heirloom mailx 12.4 7/29/08

I have added a paragraph about this point and pushed it out to Git.

Thanks,

Arnold

Ed Morton <address@hidden> wrote:

> Arnold - thanks for responding. I don't agree that is clear as that section 
> doesn't state that the 3 possibilities are considered in that order, it 
> sounds 
> like they would just be mutually exclusive but of course they aren't when it 
> come to RS=".", so what happens in gawk when the single char is a regexp is 
> ambiguous if that's the only statement about the behavior, but in any case I 
> didn't even look at the Summary section as I expected to find everything I 
> needed related to this in the main section, 4.1 How Input Is Split into 
> Records 
> (https://www.gnu.org/software/gawk/manual/gawk.html#Records).
>
> Since a Summary should be just that I'd have expect this particular 
> information 
> in section 4.14 should be summarized from section 4.1, not additional to it. 
> What's stated in 4.14 is fine as a summary, but not adequate if it's the ONLY 
> source of info on this. It also doesn't explain how to get an RS that means 
> "any 
> single character" and IMHO that is non-obvious (embarrassingly, I had to ask 
> at 
> comp.lang.awk where Janis helped me wrap my head around it as I was drawing a 
> blank!).
>
> I see now there's a clear statement of the related behavior for FS in section 
> 4.5 Specifying How Fields Are Separated 
> (https://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators):
>
>     /If //|FS|//is any other single character, such as //|","|//, then each
>     occurrence of that character separates two fields. Two consecutive
>     occurrences delimit an empty field. If the character occurs at the 
> beginning
>     or the end of the line, that too delimits an empty field. The space
>     character is the only single character that does not follow these rules./
>
> I think RS deserves the equivalent explanation in section 4.1 plus the 
> example 
> of using an RS that's any char (FS doesn't need it since there's no 
> equivalent 
> to RT that's be useful in this case and FPAT="." works as you'd expect so 
> there's no use case for FS="." as a regexp).
>
>  ?????? Ed.
>
> On 6/11/2018 1:07 AM, address@hidden wrote:
> > Hi Ed.
> >
> > The behavior is stated clearly, if tersely, in the summary section in the 
> > chapter
> > on reading input 
> > (https://www.gnu.org/software/gawk/manual/html_node/Input-Summary.html#Input-Summary):
> >
> >
> >     Input is split into records based on the value of RS. The possibilities 
> > are as follows:
> >
> >     Value of RS             Records are split on ???                awk / 
> > gawk
> >     Any single character    That character                  awk
> >     The empty string ("")   Runs of two or more newlines    awk
> >     A regexp                Text that matches the regexp    gawk
> >
> > Thanks,
> >
> > Arnold
> >
> >
> > Ed Morton <address@hidden> wrote:
> >
> >> I was recently surprised by this behavior from gawk 4.2.0:
> >>
> >>   ???? $ echo "foo" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
> >>   ???? 1 <foo
> >>   ???? :>
> >>
> >> I came across this because I was trying to process data 1 char at a time 
> >> and
> >> thought setting RT to 1 char at a time might be a valid approach rather 
> >> than
> >> writing a loop. I'm not looking for alternatives, just wondering about this
> >> specific functionality.
> >>
> >> A little investigation shows that it behaves as if I'd used RS='[.]':
> >>
> >>   ???? $ echo "foo.bar" | awk -v RS='.' '{print NR, "<" $0 ":" RT ">"}'
> >>   ???? 1 <foo:.>
> >>   ???? 2 <bar
> >>   ???? :>
> >>
> >> I expected that RT would take the values f, o, o, \n and every $0 would be 
> >> the
> >> null string, analogous to what happens when you use 2 "."s:
> >>
> >>   ???? $ echo "foo" | awk -v RS='..' '{print NR, "<" $0 ":" RT ">"}'
> >>   ???? 1 <:fo>
> >>   ???? 2 <:o
> >>   ???? >
> >>
> >> I assume it does this for compatibility with other awks where a single 
> >> char RS
> >> is always just that literal character but that seems counter-intuitive to 
> >> the
> >> way gawk uses RS as a regexp otherwise and idk how we're supposed to set 
> >> the RS
> >> to "any single character" given this implementation whereas if RS="." was
> >> interpreted as a normal regexp then we could use `RS="[.]"` to get a 
> >> literal "."
> >> just like we do for it in any other regexp context.
> >>
> >> I've since discovered that I can get the behavior I want with `RS=".{1}"` 
> >> or
> >> `RS="[[:space:]]|[^[:space:]]"` etc. but it's all pretty cludgy and 
> >> non-intuitive.
> >>
> >> I can't find anything in the gawk documentation that states that the above 
> >> is
> >> expected behavior. Assuming we can't update the code to treat RS="."?? as 
> >> if "."
> >> is a regexp metacharacter for backward compatibility, can we get a 
> >> statement
> >> saying something clear like "If RS is a single character it will be 
> >> treated as a
> >> literal character and not a regexp metacharacter" added to the 
> >> documentation and
> >> also the example of RS=".{1}" shown as a workaround for the case where the
> >> desired regexp is "a single occurrence of any character"? I can't think of 
> >> any
> >> other regexp metacharacter that this issue would apply to.
> >>
> >>   ???????? Ed.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]