[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: iso-codes .pot msgid strings contain non-ASCII characters
From: |
Bruno Haible |
Subject: |
Re: iso-codes .pot msgid strings contain non-ASCII characters |
Date: |
Mon, 24 Apr 2006 23:02:04 +0200 |
User-agent: |
KMail/1.5 |
Paul Eggert wrote:
> the GNU gettext manual says:
>
> Note that the MSGID argument to `gettext' is not subject to
> character set conversion. Also, when `gettext' does not find a
> translation for MSGID, it returns MSGID unchanged - independently of
> the current output character set. It is therefore recommended that all
> MSGIDs be US-ASCII strings.
This recommendation is directed to the "normal" use of xgettext, i.e.
extraction of the msgids from source code. The other issue - not mentioned
in the GNU gettext manual, but quite important - is that source code should
be viewable in different encodings, and when you convert some source code
from ISO-8859-1 to UTF-8 (or vice versa), the behaviour of the program
should remain the same.
The situation for iso-codes is different, because
- It is not extracted from source code; the use of XML files for the
list of country/location names greatly reduces the possible problems
when these files would be stored in a different encoding (thanks to
the encoding declaration present in XML files).
- There are quite a number of languages/countries/locations in the world
which cannot be written in ASCII, such as Norwegian Bokmål, Côte
d'Ivoire, etc.
Therefore I think it's actually OK for iso-codes to use UTF-8 as encoding
of the msgids.
The only remaining problem is in the C code: A program running in, say, an
EUC-JP locale, needs to be a little careful when accessing the message
catalog: not just
country_translation = dgettext ("iso-codes", country_english_utf8);
but
country_translation = dgettext ("iso-codes", country_english_utf8);
if (country_translation == country_english_utf8)
{
/* Not found in the message catalog. Use the English name, converted
to the correct encoding. */
country_translation =
iconv_string (country_translation, "UTF-8", locale_charset ());
}
You find code that is a little better than this one (cares about
transliteration,
non-canonicalized locale_charset() result etc.) in propername.c at
http://cvs.savannah.gnu.org/viewcvs/*checkout*/gettext/gettext-tools/lib/propername.c?content-type=text%2Fplain&rev=1.1&root=gettext
In other words, UTF-8 is the current de-facto standard encoding. I would leave
the iso-codes PO files in that encoding, and keep the support of other encodings
purely in the C code that uses the ,mo files.
> Can the format of the XML country list be extended to contain two
> spellings, one in UTF-8, one ASCII-ized? Then the algorithm wouldn't
> need to transcode.
The transliteration in glibc and libiconv is good enough.
Bruno