bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Exposing wcwidth(3) as a built-in function


From: Eric Pruitt
Subject: Re: [bug-gawk] Exposing wcwidth(3) as a built-in function
Date: Sat, 9 Dec 2017 01:18:14 -0800
User-agent: NeoMutt/20170113 (1.7.2)

On Sat, Dec 09, 2017 at 10:51:15AM +0200, Eli Zaretskii wrote:
> What do you mean by "multi-byte safe locale"?  UTF-8 is but one
> multi-byte encoding; it is not the only one.
>
> [...]
>
> Using the UTF-8 byte sequences was the reason why I asked whether your
> implementation relies on UTF-8.  In a locale whose codeset is not
> UTF-8, this will not work well.

The determination for this is simply 'length("宽")'. If that returns 1,
the interpreter is considered multi-byte safe, but I realize that could
be incorrect for non-Unicode encodings, and it's not something I
actually care about trying to address.

> > Are there some multi-byte locales where I could not count on
> > sprintf("%c", 23485) being "宽" in GNU Awk?
>
> 23485 is a single character value, so I don't understand how it is
> related to the locale's codeset being multi-byte. Instead, this has to
> do with the codeset itself and its representation and interpretation
> of codepoints.

I was imagining that in order to "insulate the programmer from the
peculiarities of the underlying platform" GAWK might do something like
use iconv to ensure that character values are invariant regardless of
the locale which is I why searched the source for iconv references.

Due to the way the "%c" format conversion works in GAWK, I ran into
something that I'm not sure qualifies as a bug: "%c" cannot be used to
write byte literals above 127 when the locale supports Unicode. An
example:

    $ mawk 'BEGIN { printf "%c", 255 }' | xxd
    00000000: ff                                       .
    $ gawk 'BEGIN { printf "%c", 255 }' | xxd
    00000000: c3bf                                     ..

I understand why this is the case, but I still find the behavior
surprising given that hexadecimal escapes in literals are interpreted as
bytes:

    $ gawk 'BEGIN { printf "\xff" }' | xxd
    00000000: ff                                       .

Assuming I didn't overlook an existing passage, it might at least be
worth mentioning this in the documentation. Since POSIX states that
"Multi-byte characters require multiple, concatenated [octal] escape
sequences of this type,"[1] it doesn't surprise me that "\377" is
interpreted as a single byte.

  [1]: 
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_04

Eric



reply via email to

[Prev in Thread] Current Thread [Next in Thread]