bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: supporting obscure languages


From: Albert Cahalan
Subject: Re: supporting obscure languages
Date: Sat, 28 Nov 2009 14:53:26 -0500

On Sat, Nov 28, 2009 at 10:49 AM, Bruno Haible <address@hidden> wrote:

> It is similar to LC_MESSAGES=zam_MX.UTF-8 LANG=fr_FR.UTF-8, which would
> be a perfectly reasonable choice for a user with French preferences but
> Zapotec language. POSIX allows users to combines different aspects of
> locales in this way.

POSIX does, but the library does not. If the library followed POSIX
then I could combine LC_MESSAGES=zam with LANG=C.

In other words, this looks like a POSIX violation to me.

> It is gross, but it is consequence of your desire to use a language
> for which the locale is not existent or not installed, and therefore
> to do in your program what normally the users do in their system. This is
> not typical. The normal case is that users set their preferences in a
> central location and these preferences get transmitted to the programs via
> environment variables.

The only part I need is installed: zam.mo

Since I never try to format time, the library shouldn't even try
to load the data for that. The missing stuff shouldn't affect
anything since I'm not attempting to use it. Supposing I did
try to format time though, that could do some typical thing.

Basically this isn't fail-safe. Some chunk of locale data goes
missing, and suddenly the whole thing dies.

>> I'm depending on some random unrelated locale
>> just to get normal UTF-8 behavior.
>
> Yes, this is worrying. But nowadays, on most desktop systems, at least
> one user locale is installed, it uses UTF-8 encoding, and you can
> enquire it through   setlocale(LC_ALL,"").
>
> The systems with only the "C" locale are small-memory devices like
> routers.

That was my system until I started debugging this problem,
and in fact an apt-get hook wipes out locales every time I
install packages.

This is because en_US.UTF-8 has defective collation order,
and because I don't normally need translations. If I were to
set either LANGUAGE or LC_MESSAGES alone though,
that ought to get me translations despite anything else.

> Internationalization of a program consists of three parts:
>  1) Make use of the Unicode character set.
>  2) Provide translations for messages.
>  3) Do the following in a locale dependent way: display of time,
>     display of currency, computations with calendar, display of
>     Hanzi ideographs (Chinese vs. Japanese - same Unicode code
>     point, different glyphs), form for entering a postal address,
>     arrangement of GUI components (right-to-left), etc.

Well no, not unless the program needs it. OTOH, Tux Paint
localizes things you don't even handle: audio clips, fonts,
font size, font vertical position, and right-to-left text rendering.

In any case, part of a locale is better than none. Right now
you're essentially saying that incomplete localization isn't
allowed; it's all or nothing.

> With a "C" locale in UTF-8 encoding, you would get part 1). You would
> not get part 2), because gettext() must not use the translation message
> catalogs in the "C" locale. You would also not get part 3), because
> strftime etc. also must not use localized values in the "C" locale.
> That's because in POSIX, the "C" locale is the locale to be set when you
> want to know ahead of time the output format of "ls", "df", "date" etc.

Ah, but I asked for a different locale.

LANGUAGE: not set to "C"
LC_ALL: not set to "C"
LC_MESSAGES: not set to "C"
LANG: not set to "C"
setlocale's 2nd parameter: not set to "C"

That right there means I didn't want the "C" locale. Additionally,
at least one of those things is not blank/empty/missing, so you
certainly know which locale I want. I expect best-effort.
I even called bind_textdomain_codeset, so UTF-8 is explicit.

Had I set nothing, I still wouldn't be asking for "C". You could
give me a "generic.UTF-8" or "NULL.UTF-8" locale that works.

BTW, even the strings being passed to gettext() are UTF-8.
I have things like the elipsis, so it's still UTF-8 even when the
translation is dumped on the floor.

> But I agree with you that it would be useful if more Linux distributors
> would install an en_US.UTF-8 locale always.

Debian seems to have chosen to add C.UTF-8. From my reading of
the code, it looks like that will fail. They'll patch it I'm sure.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]