bug-gettext
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU gettext 0.22 broke non-Unicode msgids


From: Bruno Haible
Subject: Re: GNU gettext 0.22 broke non-Unicode msgids
Date: Tue, 28 Nov 2023 01:42:07 +0100

Hello,

Robert Clausecker wrote:
> I am maintaining a project that uses msgids in ISO-8859-1 encoding.
> To build the message catalogues for the project

This is discouraged for decades. Citing the GNU gettext documentation
<https://www.gnu.org/software/gettext/manual/html_node/Charset-conversion.html>:
  "Note that the msgid argument to gettext is not subject to character
   set conversion. Also, when gettext does not find a translation for
   msgid, it returns msgid unchanged – independently of the current output
   character set. It is therefore recommended that all msgids be US-ASCII
   strings."
This is in the documentation since 2001.

Similarly, citing the xgettext documentation
<https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html#Input-file-interpretation>:
  "By default the input files are assumed to be in ASCII."

Programs like the SCHILY_utils that you mention thus, when no
translation is performed, print
  Calc release 1.27 2021/08/20 (x86_64-unknown-linux-gnu) Copyright (C) 1985, 
89-91, 1996, 2000-2021 Jörg Schilling
in ISO-8859-1 locales, but
  Calc release 1.27 2021/08/20 (x86_64-unknown-linux-gnu) Copyright (C) 1985, 
89-91, 1996, 2000-2021 J�rg Schilling
(with a Unicode REPLACEMENT CHARACTER) in UTF-8 locales. But UTF-8 locales
are the norm nowadays on Unix/Linux systems, from Linux/glibc over Linux/musl
to macOS and Solaris OpenIndiana/OmniOS.

So, what these programs do is to print Jörg Schilling's name in an unreadable
way.

Another problem of the practice of using literal ISO-8859-1 strings in source
code is that it does not work well for developers in a UTF-8 locale. For
example:
$ grep 'g Schilling"' `find . -name '*.c'`
/bin/grep: ./translit/translit.c: binary file matches
/bin/grep: ./p/p.c: binary file matches
./cdrecord/cdrecord.c:                                                          
_("Joerg Schilling"));
./scgcheck/scgcheck.c:                                                          
_("Joerg Schilling"));
./scgcheck/scgcheck.c:                                          _("Joerg 
Schilling"));
/bin/grep: ./sformat/fmt.c: binary file matches
/bin/grep: ./mdigest/mdigest.c: binary file matches
...
It requires an extra option to 'grep', namely '-a', in this situation.

The world migrated from ISO-8859-* locales to UTF-8 locales from ca.
2000 to 2013. For at least 10 years, more than 99% of the Linux/Unix
users are either in the "C" locale or in a UTF-8 locale.

> I previously used
> 
>     ${SETENV} LC_ALL=de_DE.ISO8859-1 msgfmt -o ${WRKDIR}/SCHILY_utils.mo 
> ${WRKSRC}/SCHILY_utils.po

Note: Setting LC_ALL=de_DE.ISO8859-1 for the invocation of msgfmt has an
effect only on the diagnostics produced by msgfmt. It does not have an
effect on the generated .mo file.

> However, since the update to gettext 0.22, this command fails to
> produce a correct message catalogue. As the msgids are transcoded
> to Unicode, gettext() no longer finds messages corresponding to our
> msgids.

It is a correct message catalogue. The change was made public in the
gettext-0.22 NEWS file:

* Portability:
  - On systems with musl libc, the *gettext() functions in libc now work
    with MO files generated from PO files with an encoding other than UTF-8.
    To this effect, the msgfmt program now converts the messages to UTF-8
    encoding before storing them in a MO file.  You can prevent this by
    using the msgfmt --no-convert option.

> As a workaround, I have now applied the new --no-convert option, but
> this is not a good solution as your tool broke compatibility.

Nope, '--no-convert' is not a good workaround because

  * It does not fix the output in UTF-8 locales. Printing proper names
    correctly regarding of locale encoding requires specialized code,
    such as the Gnulib module 'propername', that makes use of iconv().

  * It is possible that in a few years, all .mo files are UTF-8 encoded,
    i.e. the option '--no-convert' may go away. Similarly, it is possible
    that in a few years, xgettext assumes that input files are always
    in UTF-8 encoding, i.e. that the xgettext option --from-code goes away.

> ()  ascii ribbon campaign - for an 8-bit clean world 

This is hopelessly outdated as well. The goal of an 8-bit clean world was
relevant from 1987 to 2000. Since 2000, the goal is to support multilingual
text through Unicode (for which the 8-bit clean world is a prerequisite).

Best regards,

      Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]