[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: minor documentation suggestion for FS values and "whitespace" in gen

From: Ed Morton
Subject: Re: minor documentation suggestion for FS values and "whitespace" in general
Date: Tue, 24 Mar 2020 09:30:28 -0400
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0

Andy - isn't set_FIELDWIDTHS just splitting FIELDWIDTHS="1 2 3" into values rather than splitting the record into fields? If so then it may be OK to just use ' ' or '\t' in that context regardless of what FS is set to. I suppose if that's true then the only problems with the current code in that respect are:

1) Documentation: per the gawk manual "assigning a string containing *space*-separated numbers to the built-in variable FIELDWIDTH" (bold mine) and then we're back to what does "space" mean ([:blank:] vs [:space:] vs "whitespace").

2) Inability to set FIELDWIDTHS to a newline-separated list (which I assume no-one's actually complaining about but it's not obvious that you can't and it might be useful when setting field widths from the output of some tool):

   $ wids='2

   $ echo 'abcdefghi' | awk -v FIELDWIDTHS="$wids" '{print NR, NF; for
   (i=1; i<=NF; i++) print i, "<" $i ">"}'
        awk: fatal: invalid FIELDWIDTHS value, for field 1, near `2



On 3/24/2020 8:45 AM, Andrew J. Schorr wrote:

In the code inside field.c, set_FIELDWIDTHS uses is_blank, which tests for ' ' 
'\t', but other places (def_parse_field and re_parse_field) test for '\n' in
addition to those two. Granted, in the normal case where RS is '\n', it doesn't
matter whether FS is checking for '\n', but I suppose it could matter when
RS has an unusual value...


On Tue, Mar 24, 2020 at 03:55:07AM -0600, address@hidden wrote:
Whitespace is ' ' and '\t'.  I wll clarify the documentation, but
likely not in terms of [[:blank:]], since I suspect that in UTF locales
it can match more than just ' ' and '\t'.



Ed Morton <address@hidden> wrote:

I was just looking up which exact characters get included in the set of
field separators when FS is " " (the default value) and got confused by
this in the gawk documentation:

     Class    Meaning
     [:blank:]    Space and TAB characters
     [:space:]    Space characters (these are: space, TAB, newline,
     carriage return, formfeed and vertical tab)

     FS == " "
          Fields are separated by runs of *whitespace*. Leading and
     trailing whitespace are ignored. This is the default.
     /(bold added by me)/

I took the last statement above to mean that FS would be the set of
characters defined by the [:space:] character class but it's not since
FS doesn't include carriage return (\r) nor vertical tab (\v) (I didn't
bother checking others)when FS is " ", neither is it the [:blank:]
character class since it includes newlines (\n). Instead it seems to be
[:blank:] plus newline and that's supported by the POSIX spec if we
assume by <blank> they mean [:blank:]:

     ...by default, a field is a string of non- <blank> non- <newline>

But what does newline mean in all of the above? Is it always linefeed
(\n) on all platforms or is it LF (\n) on UNIX and CRLF (\r\n) on
Windows or something else? I really don't know.

So - maybe you could update the documentation to say "Fields are
separated by runs of the whitespace (i.e. [:blank:] plus linefeed
characters)" or similar? I couldn't find anywhere in the documentation
that states exactly which characters  FS includes when assigned " " nor
what exactly is meant by "whitespace" throughout the documentation and I
think that one tweak to provide a clear definition of the term
"whitespace" would clarify all of it.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]