[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to represent NBSP in gawk regex?
From: |
Neil R. Ormos |
Subject: |
Re: How to represent NBSP in gawk regex? |
Date: |
Mon, 21 Feb 2022 11:04:14 -0600 (CST) |
david kerns wrote:
> Eli Zaretskii <eliz@gnu.org> wrote:
>> [david kerns wrote:]
>>> from the gawk user manual, my interpretation
>>> is that gawk only accepts UTF-8 encodings...
>> That's not true, AFAIK.
> Thus the sheepish wording... I was not able to
> get UTF-16 encoding to work, so I read the
> manual... I couldn't find it clearly stated
> either way, but I did read this:
> | With the increasing popularity of the Unicode
> | character standard <http://www.unicode.org/>,
> | there is an additional wrinkle to consider.
> | Octal and hexadecimal escape sequences inside
> | bracket expressions are taken to represent
> | only single-byte characters (characters whose
> | values fit within the range 0a<c80>["]256). To
> | match a range of characters where the
> | endpoints of the range are larger than 256,
> | enter the multibyte encodings of the
> | characters directly.
> which is what Wolfgang did.
I think the lesson that should be drawn from that manual excerpt is limited to
the specific context of escaped representations of characters in bracket
expressions--i.e., within bracket expressions, as a special case, Gawk does not
form multibyte characters from runs of escaped byte values, even if those runs
of byte values are equivalent to a code point in the current locale.
But that's not true everywhere. As an example,
gawk 'BEGIN{print length("\xc2\xa0") }'
prints 1 in a UTF-8 locale, showing that Gawk recognizes the run of bytes as a
single character.
> Perhaps my real issue is that I live in an
> "LC_ALL=C" bubble
Although both David's and Wolfgang's solutions work, I wonder if there is a
more portable way to represent the character that is not nailed-up for a
specific character set. As a wishful-thinking example, if iconv accepted
"html" as one of the /character sets/ that could be specified using the
--from-code option, it might be used at run-time to translate " " to
equivalent character in the current locale. Surely a UTF-256 is on the horizon.