[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: minor documentation suggestion for FS values and "whitespace" in gen
From: |
arnold |
Subject: |
Re: minor documentation suggestion for FS values and "whitespace" in general |
Date: |
Tue, 31 Mar 2020 04:19:54 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hi Ed.
I finally took a look at this. I don't see a need for major changes in the
doc. If you look at node "Fields" it says pretty clearly:
When @command{awk} reads an input record, the record is
automatically @dfn{parsed} or separated by the @command{awk}
utility into chunks called @dfn{fields}. By default, fields
are separated by @dfn{whitespace}, like words in a line.
Whitespace in @command{awk} means any string of one or more
spaces, TABs, or newlines; other characters that are considered
whitespace by other languages (such as formfeed, vertical tab,
etc.) are @emph{not} considered whitespace by @command{awk}.
The doc does not anywhere make a claim that the whitespace is related to the
regex character class [:space:] (which in fact, it is not), so I think this
was just your confusion.
Thanks,
Arnold
Ed Morton <address@hidden> wrote:
> I was just looking up which exact characters get included in the set of
> field separators when FS is " " (the default value) and got confused by
> this in the gawk documentation:
>
> Class Meaning
> [:blank:] Space and TAB characters
> [:space:] Space characters (these are: space, TAB, newline,
> carriage return, formfeed and vertical tab)
>
> FS == " "
> Fields are separated by runs of *whitespace*. Leading and
> trailing whitespace are ignored. This is the default.
> /(bold added by me)/
>
> I took the last statement above to mean that FS would be the set of
> characters defined by the [:space:] character class but it's not since
> FS doesn't include carriage return (\r) nor vertical tab (\v) (I didn't
> bother checking others)when FS is " ", neither is it the [:blank:]
> character class since it includes newlines (\n). Instead it seems to be
> [:blank:] plus newline and that's supported by the POSIX spec if we
> assume by <blank> they mean [:blank:]:
>
> ...by default, a field is a string of non- <blank> non- <newline>
> characters.
>
> But what does newline mean in all of the above? Is it always linefeed
> (\n) on all platforms or is it LF (\n) on UNIX and CRLF (\r\n) on
> Windows or something else? I really don't know.
>
> So - maybe you could update the documentation to say "Fields are
> separated by runs of the whitespace (i.e. [:blank:] plus linefeed
> characters)" or similar? I couldn't find anywhere in the documentation
> that states exactly which characters FS includes when assigned " " nor
> what exactly is meant by "whitespace" throughout the documentation and I
> think that one tweak to provide a clear definition of the term
> "whitespace" would clarify all of it.
>
> Ed.
>
>
>
>