[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case mapping of sharp s

From: Stephen J. Turnbull
Subject: Re: Case mapping of sharp s
Date: Sun, 22 Nov 2009 02:40:09 +0900

David Kastrup writes:
 > Richard Stallman <address@hidden> writes:
 > > I don't think the design of MULE was an error in the 1990s.

Of course it was, at least as applied to the ISO 8859 family of
scripts.  In fact the ISO 8859 standard makes plain that characters
with the same name are identical across the ISO 8859 family.
Distinguishing (make-char 'latin-iso8859-1 32) from (make-char
'latin-iso8859-15 32) was a mistake, and it caused a lot of pain for
users and developers.

I agree that in Japan the design was plausible in the early 90s.  In
hindsight, I think it was an unfortunate choice, though.  It would
have been better for the Mule Lab (which has a fair amount of prestige
in this country) to lead the way toward open, universal standards by
working out the difficulties of dealing with multilingual text written
in a Unihan script (ie, Unicode).  In the end internationalized
encodings based on ISO 2022 extension techniques (such as TRON code
and Mule code) are all dead (except for ISO-2022-JP, still commonly
used in email), but Shift JIS remains in wide use, with only Unicode
gaining share.

 > I think that the design of utf-8 that makes character starts
 > immediately recognizable without the need for rescanning or
 > synchronization has been an excellent idea.  MULE coding lacks this
 > feature.

It does not lack that feature: C0 and GL codes are ASCII (one byte
characters), C1 codes are leading bytes, and GR codes are trailing
bytes.  Ie, all bytes less than 160 are character starters.  AFAIK,
Mule code developed this feature at about the same time that FSS-UTF
was invented (Mule development started in mid-1991, and the earliest
reference I can find to FSS-UTF is Ken Thompson's fss-utf.c dated
1992).  You'd have to ask Ken'ichi Handa for the exact date and
whether he was aware of FSS-UTF and such techniques when the Mule
encoding was designed.

UTF-8 doesn't really have any algorithmic string-processing advantages
over Mule code.  Even the fact that you can compute the length of a
character algorithmically from a UTF-8 leading byte is unimportant,
since it's much more efficient to use a table lookup for that.  The
big advantage of UTF-8 is that it's based on Unicode, so characters
that never should have been distinguished in the first place don't
have to be reidentified in Lisp.  Not to mention all of the useful
character data and the bidi algorithm, etc.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]