bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale_charset() on MacOS X


From: Bruno Haible
Subject: Re: locale_charset() on MacOS X
Date: Fri, 27 Jan 2012 02:14:53 +0100
User-agent: KMail/4.7.4 (Linux/3.1.0-1.2-desktop; KDE/4.7.4; x86_64; ; )

Paul Eggert reported:
> <http://lists.gnu.org/archive/html/bug-bison/2012-01/msg00107.html>.

Akim Demaille wrote:
> I'm sending this message to you as the main author of
> the quotearg module.  I am not sure which component should
> be considered guilty here, but the problem is:
> 
> - independently of any LC_*, localcharset.c returns UTF-8
>   on OS X.
> 
> - If I instrument localcharset.c, I can see that the OS
>   returns "US-ASCII" as locale_codeset.
> 
> - localcharset's get_charset_aliases then maps US-ASCII
>   to UTF-8 ...
> 
> - so quotearg decides to use nice UTF-8 quotes (since
>   quote.c asks for locale-dependent quotes).
> 
> - so the test suite fails since it expects plain old "'".
> 
> What module would be considered faulty here?

The test suite is faulty.

Rationale:

  - The localcharset.c code is meant to return the character encoding
    in the current locale. Pretty much like nl_langinfo(CODESET), except
    that the latter is botched on many systems: on some it returns
    non-standard encoding names such as "646", on some an empty string,
    and on some (such as Cygwin or MacOS X) it returns "US-ASCII" when
    in reality the character encoding is different.

    localcharset.c can be seen as an override of nl_langinfo (CODESET),
    except that it does not (yet) have the form of a gnulib-style override.

  - POSIX [1] does not specify the character encoding of the "C" locale.
    It could be US-ASCII or any extension of it, such as ISO-8859-1 or
    UTF-8.

  - On MacOS X the Terminal.app's encoding and the general text encoding
    are UTF-8.

  - On MacOS X nearly all users are working in the "C" locale. If a user
    has told the OS that he's working in the French locale, the OS does
    not set LC_* variables to indicate this, nor does the user usually
    do so (why should he? he has already specified it once). Therefore
    the normal situation on MacOS X is this:
      $ env | grep LC_
      $ locale
      LANG=
      LC_COLLATE="C"
      LC_CTYPE="C"
      LC_MESSAGES="C"
      LC_MONETARY="C"
      LC_NUMERIC="C"
      LC_TIME="C"
      LC_ALL=

  - gettext() takes care to transliterate messages to the locale encoding.
    If locale_charset() is "UTF-8", 'rm --help' will show for a French
    user

        Usage: rm [OPTION]... FICHIER...
        Supprime (défait le lien) les FILE(s).
        ...

    and for a Chinese user

        用法:rm [选项]... 文件...
        Remove (unlink) the FILE(s).

    If locale_charset() is "US-ASCII", 'rm --help' will show instead:

        Usage: rm [OPTION]... FICHIER...
        Supprime (d'efait le lien) les FILE(s).

    and for a Chinese user no translation at all:

        Usage: rm [OPTION]... FILE...
        Remove (unlink) the FILE(s).

  - quotearg's use of gettext() and locale_charset() to determine whether
    to use ‘...’ instead of '...' is entirely appropriate, because
    1. In situations where gettext() is known to make use of non-ASCII
       characters in its resulting strings, it is also OK for quotearg
       to make use of such characters.
    2. quotearg is not used in places where POSIX demands a certain
       result in the "C" locale.

In <http://lists.gnu.org/archive/html/bug-bison/2012-01/msg00091.html>
Akim also wrote:

> I had never realized that the tests are not specifying LC_ALL=C
> and they should.  But even when I do, I still have nice quotes.

Indeed there is a slight difference in behaviour between gettext()
and locale_charset(): Setting the environment variable LC_ALL=C
disables all translations in gettext() - this is needed so that some
coreutils programs can be POSIX compliant -, whereas locale_charset()
doesn't have this special code.

There are several systems with locale encoding UTF-8 in the all user
locales: Plan 9, BeOS, Haiku, MacOS X, Cygwin 1.7, and there will be more,
because it's a natural choice nowadays. In such environments, it makes
less and less sense to assign the US-ASCII encoding to the "C" locale.
US-ASCII encoding was a good choice for the "C" locale between 1996-2001,
as a transition between the ISO-8859-1 world and the UTF-8 world. It isn't
any more.

Let's fix the testsuites.

Paul Eggert wrote:
> Does the following gnulib patch fix things for Bison on OS X?
> I'll CC: this to address@hidden, to give Bruno Haible
> a heads-up about the localcharset problem.
> 
> localcharset: port to Mac OS X's C locale
> * lib/localcharset.c (get_charset_aliases) [DARWIN7]:
> Map "US-ASCII" to "ASCII".  Problem reported by Akim Demaille in
> diff --git a/lib/localcharset.c b/lib/localcharset.c
> index d86002c..68ccf60 100644
> --- a/lib/localcharset.c
> +++ b/lib/localcharset.c
> @@ -262,6 +262,7 @@ get_charset_aliases (void)
>             "ISO8859-9" "\0" "ISO-8859-9" "\0"
>             "ISO8859-13" "\0" "ISO-8859-13" "\0"
>             "ISO8859-15" "\0" "ISO-8859-15" "\0"
> +           "US-ASCII" "\0" "ASCII" "\0"
>             "KOI8-R" "\0" "KOI8-R" "\0"
>             "KOI8-U" "\0" "KOI8-U" "\0"
>             "CP866" "\0" "CP866" "\0"

Nah. "Let's break gettext() based internationalization of all GNU programs
for most MacOS X users" won't get my approval.

Bruno

[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
    section 7.2
[2] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/df.html




reply via email to

[Prev in Thread] Current Thread [Next in Thread]