[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: I18N/M17N?

From: Jim Blandy
Subject: Re: I18N/M17N?
Date: 09 May 2001 16:47:33 -0500

Masao Uebayashi <address@hidden> writes:
> Hello, and sorry for tardiness.
> > > Just a curiousity, but is there any plan that Guile to be I18N'ed, or
> > > better M17N'ed? Or inporting MULE features from Emacs???
> > 
> > I was supposed to work on this, but I don't spend time on Guile any
> > more (I need to resign as a maintainer), so I encourage anyone who's
> > interested to pick up the ball here.  Read guile-core/doc/mbapi.texi.
> > There's even some code, on a branch called jimb_mb_branch_1.
> I'm not familiar with such an area (too), but this looks nice to
> me. At least, I feel a bit happiness to see that Guile is not destined
> to UTF-8 as Perl 6 or Python 2.

The last time I spoke with Handa-san (the fellow who designed and
maintains the multilingual support in GNU Emacs), he felt it was
likely that GNU Emacs would convert to UTF-8 internally, and use some
application-specific range of the 32-bit character space to
distinguish Chinese and Japanese characters.

Unicode is (or was) controversial in Japan.  For the benefit of other
readers, I'll summarize my understanding of the conflict.  Since I
don't actually use these character sets and encoding myself, I've
almost certainly misunderstood some things, and probably even have
some of the basic facts wrong, but I think I can get the basic idea

The encodings currently in widespread use in Japan are, to my eyes,
rather complicated.  The encodings use three- or four-byte escape
sequences to switch between a single-byte-per-character encoding,
basically ASCII, and various two-byte-per-character encodings.  In the
two-byte-per-character stretches, the bytes in each pair must have
values between 33 and 126, so one does not find random control
characters in valid sequences of two-byte characters.  Each pair can
select one from (expt (1+ (- 126 33)) 2) => 8836 possible characters.
I think the two-byte stretches aren't allowed to cross line
boundaries, so tools like `grep' won't shred your files by separating
start escapes from their matching end escapes.  But it's still a
stateful encoding: in order to know how to interpret a given byte, you
need to scan from the beginning of the line --- a typically small (but
unbounded) distance.

This probably seems really hairy to most American and European
programmers, accustomed to the luxurious simplicity of single-byte
character sets.  But the Japanese programmers I have spoken to (a
dozen at most, between 1994 and 1999) are perfectly familiar and
comfortable with the encodings described above, having used them all
their lives.

One nice side effect of that encoding is that one can use different
escape sequences to select different character sets for stretches of
two-byte-per-character text.  For example, "\033$B" selects the
JIS-X-0208-1983 character set (most commonly-used Japanese
characters), while "\033$(A" selects GB 2312-80, a Chinese character
set.  The whole arrangement is inherently multilingual --- you can
drop in any characters you like simply by inventing new escape

So, essentially, this means that all Japanese programmers are
accustomed to having text indicate not only the characters, but also
the *language* those characters represent.  In particular, they feel
it is important that the encoding distinguish between Chinese text and
Japanese text.  Now, they all agree that Chinese and Japanese use the
same characters.  When speaking in English, Japanese programmers refer
to the characters they use in their own names and in everyday writing
as "Chinese characters".  (In fact, I think the Japanese word "Kanji"
actually means "Chinese characters" --- but I am very unreliable on
questions like that.)  A friend of mine in Kyoto compared the Japanese
vs. Chinese situation to the French vs. English situation: certainly
the English word "car" and the French word "car" ("because") are
different words, but everyone agrees they're the same three letters.

Unicode, of course, does not preserve this distinction.  If you
transliterate a sequence of Japanese text encoded as described above
that uses both JIS-0208 and GB 2312-80 into Unicode, and then
translate it back, you'll lose information about which stretches used
which encoding.  You can imagine that it might be difficult to walk up
to someone who already has has a complete set of tools which handles
this stuff correctly and persuade him to abandon them for a completely
different set of tools that doesn't.

The encoding currently used in GNU Emacs preserves the language
distinctions.  Each character encodes a character set, and a point
within that character set.  They use a nicer encoding than the one
described above (for example, it's stateless), but it's basically just
a better representation for the same information.

However, Handa-san certainly acknowledges that the rest of the world
has settled on Unicode, and feels that Emacs should tend in the same
direction.  Emacs's character-set-and-character approach has its
limitations (e.g., distinguishing characters which it shouldn't) that
Unicode would help solve.  However, since there is such a strong
sentiment in Japan that Chinese and Japanese text should be
distinguished, he wants to Emacs to use a variant of UTF-8 internally
that does make this distinction.

So it's *almost* Unicode.  Code that properly handles multilingual
text will work properly with his almost-Unicode.  And Emacs is always
ready and willing to convert text coming in and going out, so it
shouldn't affect what appears on peoples' disks or goes over networks.
What it does hurt is the ability for C code linked into Emacs to use
stock tools for processing multi-lingual text in memory.

So, the plan was to use the Emacs / MULE encoding in Guile until Emacs
itself switches to this UTF-8-that-distinguishes-Chinese-and-Japanese,
at which point Guile would switch too.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]