[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Exposing wcwidth(3) as a built-in function

From: Eli Zaretskii
Subject: Re: [bug-gawk] Exposing wcwidth(3) as a built-in function
Date: Sat, 09 Dec 2017 10:51:15 +0200

> Date: Fri, 8 Dec 2017 15:25:34 -0800
> From: Eric Pruitt <address@hidden>
> Cc: address@hidden
> > Thanks, but doesn't this still assume UTF-8 encoding of characters?
> > If so, it's not portable to non-UTF-8 locales, right?
> I realized I may've misinterpreted your question, so I will clarify and
> add a question of my own: only the code for interpreters that are not
> multi-byte safe falls back to manual UTF-8 parsing. This means that in
> GAWK, the lookup table uses lexical comparisons assuming the locale is
> multi-byte safe.

What do you mean by "multi-byte safe locale"?  UTF-8 is but one
multi-byte encoding; it is not the only one.

> Are there some multi-byte locales where I could not count on
> sprintf("%c", 23485) being "宽" in GNU Awk?

23485 is a single character value, so I don't understand how it is
related to the locale's codeset being multi-byte.  Instead, this has
to do with the codeset itself and its representation and
interpretation of codepoints.  E.g., if the locale's codeset is some
ISO-2022 variant, where codepoints are specific to each charset, I
think 23485 could very well be something other.  For example, in
codepage 936, this character's codepoint is 49133.  (Codepage 936 is
used by Windows in Far Eastern locales; it is a multibyte encoding,
but the length of its byte sequences is fixed, unlike that of UTF-8.)

> From running
> "fgrep -ir iconv --include '*.h' --include '*.c'", it doesn't look like
> GAWK uses iconv. Perhaps a more accurate question is, will GAWK work on
> platforms that do not have **any** Unicode support (be it UTF-8, UTF-16,
> etc.)?

It already does: MS-Windows is one such platform.  (It does support
UTF-16, but using that would require not to use 'char *' pointers for
text, which would require a thorough rewrite of most of Gawk's code,
something I don't expect to happen just to cater to Windows.)

> * I have since rewritten the code for multi-byte unsafe interpreters so
>   the lookup table is indexed by UTF-8 byte strings instead of numeric
>   code points for performance reasons.

Using the UTF-8 byte sequences was the reason why I asked whether your
implementation relies on UTF-8.  In a locale whose codeset is not
UTF-8, this will not work well.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]