bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCHes] Add basic multibyte charset handling to makeinfo


From: Miloslav Trmac
Subject: Re: [PATCHes] Add basic multibyte charset handling to makeinfo
Date: Fri, 08 Dec 2006 17:12:39 +0100
User-agent: Thunderbird 1.5.0.8 (X11/20061107)

Eli Zaretskii napsal(a):
>> Date: Tue, 05 Dec 2006 12:51:29 +0100
>> From: Miloslav Trmac <address@hidden>
>> CC: Karl Berry <address@hidden>, address@hidden
>> - character set names are not portable across operating systems
> Sorry, I don't think I understand.  Can you provide a few examples of
> such non-portability?
First, there is simply no standard for the names, so what we have can at
best be considered empirical evidence.

I don't have access to the systems, but searching the Internet suggests
e.g. HP-UX 10.x uses "iso88591", AIX uses "ISO8859-1" and "8859-15".
How does a generic code form a locale name with a specific charset?
What if LANG=ja_JP and @documentencoding is iso-8859-1?

>> - even if you know that "iso-8859-1" is an acceptable character set
>>   name, that doesn't mean a locale using that character set exists.
>>   $current_locale.iso-8859-1 most likely doesn't exist.
> There should be no problem to have a data base of valid locales.
That would actually be a nightmare - only glibc supports 644 locales,
and the set of locales is version-dependent.  Some distributions allow
users to install only a subset of the locales, so the database would not
necessarily be valid.

Anyway, the question is not "does locale $foo exist" - one can simply
use setlocale() and find out - but "does any locale with $encoding exist
  on the current system, and if so, what is it called?".

> Finally, we could even redesign @documentencoding to require a valid
> locale name, not just an encoding name, if that would make the
> difference.
That would make the documents unportable.

>> So, if we want @documentencoding, we can't use system locales, and we
>> need a replacement that does at minimum the equivalents of mbtowc () and
>> wcwidth ().  It is completely unreasonable to implement this directly
>> inside texinfo sources, and I don't think it is really practical to make
>> texinfo dependent on some other library that provides this functionality
>> (ICU, maybe?).
> I think, given the above and what Karl said, you could simply switch
> locales dynamically, to support both the @documentencoding locale and
> the current locale for diagnostic messages from makeinfo.
The one problem again - what is the locale called?

>> The UNIX world basically assumes a single system-wide character set (a
>> single character set must be used for the names in the filesystem, at
>> least);  while technically possible, adding character set indication to
>> every text file format and character set conversion to every program
>> using the file format is not practical: it is too much work, it adds
>> confusing failure modes and it breaks the traditional text manipulation
>> tool usage.
> 
> We are not talking about such a large change (although Emacs and the
> modern Unicode-based editors already allow you to manipulate
> multilanguage texts).
It is a rather large change.  emacs is so large that any such feature is
comparatively small, but that's not really the case with texinfo.
Besides, the information should still be centralized in libc and not
maintained separately in each application.


We could somewhat workaround the problem of locale names by using
iconv() to convert the input document to the characer set specified by
nl_langinfo (CODESET).  This has three problems:
- iconv character set names are, again, not standardized, so the
  @documentencoding parameter would probably not be portable.
  We can sort of solve that by mandating the use of names supported by
  glibc and GNU libiconv - although this would still mean that makeinfo
  would use its own locale-specific data instead using the libc data.
- this can lead to information loss, e.g. with
    (LC_ALL=en_US.iso-8859-1 makeinfo ...).
  we can't preserve the non-8859-1 characters.
- this still keeps makeinfo output dependent on the locale it is started
  in.

> More generally, I'm deeply disturbed to hear in the year 2006
> arguments that in effect say that m17n and i18n are not needed, since
> l10n is good enough.  That was the idea 10 years ago; a lot of water
> went under the bridge since then, and I thought we moved on...
Yes, we did move on - from a world of hundreds of separate character
sets to Unicode.  It is simply the most practical technical solution to
the problem.

IMHO "i18n" means "the program can be adapted to run in any local
environment", not "the program includes a layer that supports some fixed
set of local environments".


The patches assume a single-character-set system, and degrade gracefully
when this does not hold.  Thus, e.g.:
* ISO-8859-x users
  - can format and view info files in ISO-8859-x
  - if they format info files using UTF-8, the result will be readable,
    but badly formated, as it always was
  - if they view info files using UTF-8, they see the same garbage they
    have always seen
* UTF-8 users
  - can format and view info files in UTF-8; the formated result will
    be correctly formated, which it wasn't
  - if they format info files using ISO-8859-x, the result will be
    readable and probably correctly formated, as it always was
  - if they view info files using ISO-8859-x, they see the same garbage
    they have always seen.


The patch is a net win for UTF-8 users and should be neutral for other
users (or positive, if they use a non-stateful multibyte encoding).  I
don't think there is any real reason to add substantial support to
non-UTF-8 character sets nowadays if the support wasn't necessary before.
        Mirek




reply via email to

[Prev in Thread] Current Thread [Next in Thread]