[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MML charset tag regression

From: Stephen J. Turnbull
Subject: Re: MML charset tag regression
Date: Tue, 29 Apr 2003 16:12:12 +0900
User-agent: Gnus/5.090016 (Oort Gnus v0.16) XEmacs/21.5 (cabbage)

>>>>> "Simon" == Simon Josefsson <address@hidden> writes:

    Simon> Emacs behaves different from xterm, gnome-terminal, gedit,
    Simon> etc though.

The X protocol is designed so that clients with different needs/wants
can negotiate the best available transfer.

    Simon> Is this a bug in that client?

Yes.  We lose.

    Simon> Or maybe emacs can detect that the TEXT request failed?  Is
    Simon> "?????"  some magic string emacs can test for?

No.  Heuristic, yes.  Standard or wide-spread practice, no.
Unfortunately, a failed request should return a failure indication,
and no data, not some bogus data.  Apparently these clients fail to do
that correctly.

The big problem with TEXT is that it gives the requestor no way to
negotiate content.  TEXT is simply whatever the selection owner
chooses to spew; you'd better be able to handle it.

Emacs should avoid asking for TEXT.  The algorithm should be

0. Ask for TARGETS.  A proper client will be able to tell you what it
   supports.  (We may be able to cache this information, and avoid
   X protocol round-trips.)  In steps 1-4 below, qualify with "unless
   known to be unavailable."
1. Ask for UTF8_STRING or COMPOUND_TEXT first.  Default to
   UTF8_STRING, but there should be a user option to start with
   COMPOUND_TEXT (the Unihan disambiguation problem).
2. Ask for the other universal encoding.
3. Ask for STRING (ISO 8859/1, if that is not known to be unacceptable).
4. Ask for Heaven's intercession, and TEXT.

(Now I see why UTF8_STRING is a good thing; even though the _sender_
can use COMPOUND_TEXT to send UTF-8 reliably, requesting COMPOUND_TEXT
doesn't restrict the sender to UTF-8.)

    Simon> Unless there is some well-agreed on non-controversial
    Simon> recommendation on how internationalized X11 cut'n'paste
    Simon> should work, all attempts to get a complete system working
    Simon> seems futile.

I don't see why the above should be controversial, except that there's
the Unihan political issue, and some Asian language users would want
the factory default to be COMPOUND_TEXT in Han-using locales.

To deal with broken clients, it might be best to have the above
algorithm implemented as a Lisp list containing targets in order of
desirability.  Then if a client is known to send junk when
COMPOUND_TEXT is requested, you can not send it.  This might also
allow the selection request function to be flexibly used.  (Eg, if the
selection contains an image, you could prepend (PIXMAP POSTSCRIPT) to
the list of text targets, where presumably the text targets would get
the ALT string from HTML or a tooltip from a toolbar button, etc.  To
get a file name, you could prepend (FILE) (the problem with the text
targets is that they might be interpreted as "send me the file
contents").  And so on.)

By having a cache of windows we've gotten stuff from, we could (1)
avoid round-trips to get the TARGET list, and (2) keep a record of
TARGETs that give undesired results, etc.

    Simon> Galeon uses GTK2 and obviously it doesn't produce a good

Depends on what you mean by "good."  This method guarantees that a
font capable of displying the text is available in the standard X
distribution (ISTR that ISO 8859/5 fonts appeared well after Japanese
fonts in X, and I doubt that X distributes KOI8 fonts at all, although
they're easily available).

    >> The new encoding method using "Non-Standard Character Set
    >> Encodings" of COMPOUND_TEXT makes the cyrillic case much more
    >> complicated.  In some case (perhaps only in KOI8 locale), X
    >> clients recently start to encode cyrillic characters in "ESC %
    >> / 0 ...".  They don't consider the situation that the requester
    >> is running in a different locale.  :-(

I don't understand the problem, as long the extended segment is
properly formed, you know it's KOI8.  How is this different from TEXT?
The extended segment is much better than the alternative I've seen,
which is sending non-Latin-1 text as STRING!

    Simon> Do you mean the client sends data in a locale-specific
    Simon> charset via COMPOUND_TEXT?  Ouch.

COMPOUND_TEXT _is_ basically locale-specific.  It's a modal ISO 2022
encoding.  The only semantic difference between the usual escape
sequence and the extended segment used for UTF-8 and KOI8 is that
extended segments can be used for not-yet-standardized encodings that
don't have an ISO-registered final byte.  The method is actually
better than that for the standard encodings since it includes a length

    >> Perhaps, we should make Emacs to request UTF8_STRING at first
    >> if the locale is UTF8, and if that request fails, request

    Simon> This sounds like a good idea to me.

Locales are just plain broken for this purpose.  As Handa-san points
out, you have no idea what locale the partner is running in.  Our own
locale is the best heuristic for Emacs if the partner is unwilling to
talk about it, but really we need clients that implement a proper
negotiation protocol.  I'm regularly running clients in three separate
locales simultaneously on the same host (POSIX, ja_JP.eucJP,
en_US.utf8).  I imagine many Europeans are in a similar situation.
(And I haven't even started to talk about my development/testing

I think that we should start by being "selfish", ie, think about what
form of text Emacs is best prepared to use, and request that.  I would
say _always_ request UTF8_STRING unless we have reason to believe the
sender can't do it (eg, previously failed) or our user would prefer
COMPOUND_TEXT (eg, that fraction of Han users).  (I'm thinking in
terms of emacs-unicode, obviously.)

Also, a related topic, I think that we should think carefully about
canonicalizing variant codes (such as "full-width" Latin or Cyrillic
characters).  For example, I'm pretty careful about the aesthetics of
half-width and full-width characters in my Japanese mail, but my
colleagues no longer are (in fact, I once received a mail in which the
4 digits of the year were in three different encodings! JIS X 0201,
JIS X 0208, and ASCII).  When I investigated this curiosity, what I
found is that on most Windows and Mac systems the full- and half-width
variants are visually hard to distinguish, and the JIS Roman and ASCII
characters are the identical glyph with different indices in the Cmap
going to the same CID.

Of course such canonicalization needs to be user-controllable, but I
doubt most users will even notice if we default to canonicalization.

Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]