[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs Lisp's future

From: Stephen J. Turnbull
Subject: Re: Emacs Lisp's future
Date: Wed, 15 Oct 2014 12:07:39 +0900

Paul Eggert writes:
 > On 10/14/2014 12:03 AM, Stephen J. Turnbull wrote:
 > > in the Emacs tree "grep -r"
 > > is probably just a bug.
 > Although "grep -r" doesn't conformto POSIX, it is a handy GNU extension, 

It's not a question of conformance, it's a question of GIGO.  As you
yourself know:

 > grep works reasonably well even with text files in the "wrong" encoding, 
 > and even with non-text files.  I don't expect grep to match UTF-8 
 > patterns to the corresponding EUC-JP text, because I know it doesn't 
 > translate.

Oh, so you intentionally chose an example where you know it works, and
published that on a public mailing list, without warning the kids not
to try it at home?  Do you realize that although all Japanese computer
users occasionally experience mojibake, only a few understand the
mechanism and its implications for "simple" operations like grep?  I
suppose that goes in spades for the Chinese.  Consider searching for
元気 to find HELLO, "knowing" that Emacs uses the UTF-8 encoding!

 > Emacs's M-x grep command supports this usage well, and I don't see how 
 > it would be an improvement to call this usage a "bug" or for the Emacs 
 > (or grep) default to insist on strict coding correctness here.

Ah, so you've never lived anywhere but Kansas, Dorothy?  There are 1.5
billion[1] Asians who disagree that "grep -r しまった" is well-
supported by Emacs or grep in an environment with multiple encodings,
which is most of them (except where they've consciously instituted a
program of converting legacy documents to a common encoding).  That's
why the "Japanese patch" is also "the patch that would not die".

But that patch is not in any mainline program that I know of, because
accurate auto-detection requires knowledge of the target language so
it doesn't generalize (the "Japanese patch" assumes that the language
is Japanese, so it must be facing ISO-2022-JP, Shift-JIS, or EUC-JP,
and relatively recent versions added UTF-8 and BOM detection to that).
The patch is not able to distinguish EUC-JP from EUC-CN, for example,
in typical use where the designations of character sets to registers
is implicit.  (Distinguishing Shift-JIS from Big5 is highly but not
100% reliable, and of course distinguishing the language variants of
ISO-2022-7 is trivial because the control sequences specify character
sets to be installed in the GL register.)

 > Eli is correct that UTF-8 is the encoding typically used for text
 > in the Emacs source code.  For more about this, please see "Source
 > file encoding" in admin/notes/unicode.

XEmacs made that decision in 1998 (only using ISO-2022-JP).  I know
how this works.  The only difference between us is that I live in
Tsukuba, and I've spoken to Handa and Tomita inter alia about these
issues over beers (in Japanese as well as in English), and I've read
the extremist anti-Unicode tirades (in Japanese).  I don't know *why*
Dr. Handa sides with those maniacs (they claim that JIS incorporates a
mystic Yamato-damashii = "authentic Japanese spirit") although I
believe it's out of a genuine desire to support multiculturalism (via
his specialty of developing multilingual software).

However, like the Japanese patch, detecting culture and choosing font
for the same repertoire via encoding is a limited technique.  It only
works well for Han-using languages.  For example, the northern
European countries have different notions about positioning of
accents, which is apparently noticable to non-native speakers with
umlauts.  I suspect (though I haven't asked and don't have time to
search the library for wordwide newspapers) that the various
English-speaking cultures, the French, the Spanish, the Italians, and
the Germans have different notions of what constitutes readable or
beautiful typography -- it's definitely the case that the ASCII
characters in Japanese fonts "look Japanese" (to me, anyway).  But
good luck choosing fonts based on distinguishing ISO-8859-1 from
ISO-8859-1! :-)

Dr. Handa's approach to multiculturalism, then, is fundamentally
different from that of the engineers and scholars who have evolved
Unicode (more precisely, universal coded character sets and the
related encoding mechanisms) over the last 30 years or so, not to
forget the W3C which has concluded that (as long as conventional
glyphs are available for the character repertoire) font choice is
purely a presentation issue, and should be handled by markup.  Unicode
has even deprecated the use of "language tag" characters.  They do
remain in the repertoire, so could be used to deal with the issues
we are discussing.


Note that the language tags are isomorphic to control sequences as
used in ISO 2022 (except that being encoded in a block disjoint from
graphic characters, they're harder to screw up), so they introduce no
text handling issues for Emacs not already present in encodings using
ISO 2022 extension techniques.

So there you have it.  There is *no barrier* to converting *all* files
to conformant UTF-8, except a couple hours' hacking to make
`help-with-tutorial' and `view-hello-file' recognize language tags.[2]
It might be preferable to use a different approach, more conformant to
the Unicode/W3C party line, though.

Thank you for your persistence.  This discussion will greatly inform
my future work in XEmacs.  (I'm done discussing the issue for Emacs,
because I don't expect Dr. Handa -- who is more expert than I -- to
change his approach after all these years.  This is all just IMHO FWIW
YMMV -- and I suspect Dr. Handa counts his "mileage" in kilometers. ;-)

[1]  I don't know about Indic languages.  I'm under the impression
that these days they almost universally use Unicode in preference to
ISCII and such-like, so they may not have the issue.  If that is
incorrect, then you can make that 2.5 billion Asians.

[2]  Note that the limitation of the hack to those functions only is
consistent with the Unicode-recommended usage of language tags.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]