Re: GNU gettext 0.22 broke non-Unicode msgids

bug-gettext

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU gettext 0.22 broke non-Unicode msgids

From:	Bruno Haible
Subject:	Re: GNU gettext 0.22 broke non-Unicode msgids
Date:	Tue, 28 Nov 2023 03:34:51 +0100

[Please keep bug-gettext in CC.]

Hi Robert,

Robert Clausecker wrote:
> Hi Bruno,
> 
> Am Tue, Nov 28, 2023 at 01:42:07AM +0100 schrieb Bruno Haible:
> > Hello,
> > 
> > Robert Clausecker wrote:
> > > I am maintaining a project that uses msgids in ISO-8859-1 encoding.
> > > To build the message catalogues for the project
> > 
> > This is discouraged for decades. Citing the GNU gettext documentation
> > <https://www.gnu.org/software/gettext/manual/html_node/Charset-conversion.html>:
> >   "Note that the msgid argument to gettext is not subject to character
> >    set conversion. Also, when gettext does not find a translation for
> >    msgid, it returns msgid unchanged – independently of the current output
> >    character set. It is therefore recommended that all msgids be US-ASCII
> >    strings."
> > This is in the documentation since 2001.
> 
> Thank you for your input.  It is unfortunate that you are hostile to people
> who wish to have their translation strings in foreign languages that cannot be
> represented in ASCII, but so be it.

Oh, I'm not hostile to such people :-) Simply, when this documentation was
written, in 2001, it was a sensible restriction because
  - It makes sense to go through English as the pivot language (rather than,
    say, Latin or Esperanto), i.e. provide all msgids in English. Even if
    the author of the program is German or Chinese. This is because English
    is the language for which one can find the most translators.
  - English can be mostly written with US-ASCII only.
  - There is much less charset-conversion complexity in the runtime and in the
    tools if we can assume that the msgid is US-ASCII only.

It would have been possible to design the gettext() function in such a way that
it does charset conversion on the msgid when no translation is found. But it
wasn't designed this way.

At some point in the future, we may assume a UTF-8-only world, where the
msgid and the gettext() result are both in UTF-8 in all cases. Then it will be
possible to use German or Chinese or whatever as language of the msgids. But
this would still have the drawback of making it harder to find translators.

> We have carefully designed our
> translations to ensure that a translated string is always present, so this
> advisory does not apply to use.

How can you ensure this? The translation is looked up from a .mo file in a
location that encodes the locale. And it is impossible to enumerate all locales,
because
  - new ones are being added every year (from Igbo to Aztec),
  - the user is free to create their own locales, through POSIX 'localedef'.

> > Similarly, citing the xgettext documentation
> > <https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html#Input-file-interpretation>:
> >   "By default the input files are assumed to be in ASCII."
> > 
> > Programs like the SCHILY_utils that you mention thus, when no
> > translation is performed, print
> >   Calc release 1.27 2021/08/20 (x86_64-unknown-linux-gnu) Copyright (C) 
> > 1985, 89-91, 1996, 2000-2021 Jörg Schilling
> > in ISO-8859-1 locales, but
> >   Calc release 1.27 2021/08/20 (x86_64-unknown-linux-gnu) Copyright (C) 
> > 1985, 89-91, 1996, 2000-2021 J�rg Schilling
> > (with a Unicode REPLACEMENT CHARACTER) in UTF-8 locales. But UTF-8 locales
> > are the norm nowadays on Unix/Linux systems, from Linux/glibc over 
> > Linux/musl
> > to macOS and Solaris OpenIndiana/OmniOS.
> 
> Fixing the display of this name is largely why a message catalogue is used in
> the first place.

A message catalog is not enough for this purpose, precisely because of the
charset encoding of the msgid.

> The other reason is to serve as a way to test if NLS support
> was implemented correctly in the tools.  We plan to roll out localised 
> messages
> throughout all of the tools in the future.  The msgid strings will likely be
> in ISO-8859-1 for the future, too, as that is the character set used 
> throughout
> the project.

In this case, to make the output work right in UTF-8 locales, you will need to
create your own variant of the gettext() function, that performs an iconv()
conversion if gettext() has returned the original msgid untranslated.

> > So, what these programs do is to print Jörg Schilling's name in an 
> > unreadable
> > way.
> 
> They only do so if no message catalogue is installed or if it is defective, as
> it is with gettext-0.22.  It is no surprise that a program designed to use NLS
> misbehaves if is NLS support is intentionally crippled.

I responded to this already in the previous mail.

> > Another problem of the practice of using literal ISO-8859-1 strings in 
> > source
> > code is that it does not work well for developers in a UTF-8 locale. For
> > example:
> > $ grep 'g Schilling"' `find . -name '*.c'`
> > /bin/grep: ./translit/translit.c: binary file matches
> > /bin/grep: ./p/p.c: binary file matches
> > ./cdrecord/cdrecord.c:                                                      
> >     _("Joerg Schilling"));
> > ./scgcheck/scgcheck.c:                                                      
> >     _("Joerg Schilling"));
> > ./scgcheck/scgcheck.c:                                          _("Joerg 
> > Schilling"));
> > /bin/grep: ./sformat/fmt.c: binary file matches
> > /bin/grep: ./mdigest/mdigest.c: binary file matches
> > ...
> > It requires an extra option to 'grep', namely '-a', in this situation.
> 
> Yes, this project assumes you work in the ISO-8859-1 locale.  We do not plan 
> to
> change that.  Likewise, projects with source code forms in UTF-8 locales are
> hard to work with for people in other locales.  We recognise this limitation
> but would like to stay with our choice.

Your choice. But new co-developers will tell you that they have a hard time
working in a ISO-8859-1 codebase, when their locale is UTF-8.

> > The world migrated from ISO-8859-* locales to UTF-8 locales from ca.
> > 2000 to 2013. For at least 10 years, more than 99% of the Linux/Unix
> > users are either in the "C" locale or in a UTF-8 locale.
> 
> Note that this depends on country.  E.g. CJK users are not entirely happy with
> Unicode as it butchers Han characters in various unpleasant ways.

Your information is outdated.
1. It was only the Japanese community which was upset about Han unification.
   (Chinese and Korean people were happy with it.)
2. Their issues were addressed in Unicode, through the addition of variation
   selectors (and probably also specialized fonts and/or tailoring in the
   rendering engines). The Japanese complaints have since then silenced down.

> So in
> Taiwan, China, Japan, and Korea, you'll keep seeing non-Unicode users for the
> foreseeable future.

I was under the same impression, until I reviewed the use of GB18030 in Linux
distros recently.
<https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html>
Even distros specialized for the Chinese market put the user into a UTF-8
locale — not only by default, but *always*.

> We too have no plans to change from ISO-8859-1 as Unicode
> is incompatible with old systems supported by Schilytools.

I made the change in gettext 0.22 because ISO-8859-1 is incompatible with
modern systems such as musl libc.

> > > I previously used
> > > 
> > >     ${SETENV} LC_ALL=de_DE.ISO8859-1 msgfmt -o ${WRKDIR}/SCHILY_utils.mo 
> > > ${WRKSRC}/SCHILY_utils.po
> > 
> > Note: Setting LC_ALL=de_DE.ISO8859-1 for the invocation of msgfmt has an
> > effect only on the diagnostics produced by msgfmt. It does not have an
> > effect on the generated .mo file.
> 
> Thank you for this information.  I'll remove this ${SETENV} invocation
> under the assumption that it does not affect output.  Other tools do
> some times have locale-defined behaviour, so I wanted to avoid any trouble
> here.

You're welcome.

> > > However, since the update to gettext 0.22, this command fails to
> > > produce a correct message catalogue. As the msgids are transcoded
> > > to Unicode, gettext() no longer finds messages corresponding to our
> > > msgids.
> > 
> > It is a correct message catalogue. The change was made public in the
> > gettext-0.22 NEWS file:
> > 
> > * Portability:
> >   - On systems with musl libc, the *gettext() functions in libc now work
> >     with MO files generated from PO files with an encoding other than UTF-8.
> >     To this effect, the msgfmt program now converts the messages to UTF-8
> >     encoding before storing them in a MO file.  You can prevent this by
> >     using the msgfmt --no-convert option.
> > 
> > > As a workaround, I have now applied the new --no-convert option, but
> > > this is not a good solution as your tool broke compatibility.
> > 
> > Nope, '--no-convert' is not a good workaround because
> > 
> >   * It does not fix the output in UTF-8 locales. Printing proper names
> >     correctly regarding of locale encoding requires specialized code,
> >     such as the Gnulib module 'propername', that makes use of iconv().
> 
> Once again, this limitation is not relevant for us as we ensure that message
> catalogues for all locales are installed.

You can't enumerate all locales. See above.

> But it is also unfortunate that you effectively tell us that there is no
> workaround, save for migrating all msgids to ASCII or everything to UTF-8.
> Both of which are things we would rather avoid.

I've mentioned two other possible workarounds:
  - (in the previous mail) The Gnulib 'propername' module.
  - (above) A wrapper around the gettext() function.

> >   * It is possible that in a few years, all .mo files are UTF-8 encoded,
> >     i.e. the option '--no-convert' may go away. Similarly, it is possible
> >     that in a few years, xgettext assumes that input files are always
> >     in UTF-8 encoding, i.e. that the xgettext option --from-code goes away.
> 
> Thank you for notifying me of this upcoming breaking change.  It should be
> communicated more clearly as cross-encoding use of gettext was one of its
> major selling points over previous solutions, which required separate message
> catalogues for each encoding.  We will evaluate what our options are.  We
> would like to not migrate to Unicode.

"We would like to not migrate to Unicode."

You are aware that the mail you sent me was labelled as
  "Content-Type: text/plain; charset=utf-8" ?

You are aware that when you copy&paste text in an X11 GUI, it happens through
a property named UTF8_STRING, with UTF-8 encoded contents? And that this happens
flawlessly? And before this UTF8_STRING existed, another property was used,
with CompoundText encoded contents. And with this property, copy&paste sometimes
did not work as expected, because CompoundText in various places meant different
things and/or there were differences between the behaviours of different charset
converters.

The fewer charset conversions need to be done, the more reliable the programs
become, and the more maintainable the code can become.

> > > ()  ascii ribbon campaign - for an 8-bit clean world 
> > 
> > This is hopelessly outdated as well. The goal of an 8-bit clean world was
> > relevant from 1987 to 2000. Since 2000, the goal is to support multilingual
> > text through Unicode (for which the 8-bit clean world is a prerequisite).
> 
> Given that you seem to be taking a step back from an encoding-agnostic world
> to one that is Unicode-only, maybe it's indeed time to update it.

Yes, i18n through the "let's be encoding agnostic" approach was predominant
until ca. 2000 or 2001, because people feared that Unicode would not last and
would be replaced by something else within a few years.
But then, people (me included :-)) started popularizing the "i18n through
Unicode" approach, and it is successful because
  - it is simpler in the code — no charset conversion in many places,
  - the Unicode consortium does a good job at responding to complaints and
    new feature requests (from Han unification mitigation, to Emojis).

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

GNU gettext 0.22 broke non-Unicode msgids, Robert Clausecker, 2023/11/27
- Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/27
  - Message not available
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible <=
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Robert Clausecker, 2023/11/27
    - Re: English for msgids, Bruno Haible, 2023/11/28
    - Re: English for msgids, Robert Clausecker, 2023/11/28
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/28
    - Re: Unicode, Bruno Haible, 2023/11/28
  - Message not available
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/28
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Robert Clausecker, 2023/11/28
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/28

Prev by Date: Re: GNU gettext 0.22 broke non-Unicode msgids
Next by Date: Re: GNU gettext 0.22 broke non-Unicode msgids
Previous by thread: Re: GNU gettext 0.22 broke non-Unicode msgids
Next by thread: Re: GNU gettext 0.22 broke non-Unicode msgids
Index(es):
- Date
- Thread