bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Cut from xterm (iso-8859-{2,15}) and paste into buffer


From: Kenichi Handa
Subject: Re: Cut from xterm (iso-8859-{2,15}) and paste into buffer
Date: Mon, 19 Nov 2001 10:27:43 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

Karl Eichwalder <keichwa@gmx.net> writes:
> Also "compound-text", but it does not help to say:

>     C-x RET x iso-8859-15 RET

>>  Do you happen to know what exactly does Emacs get as the raw string
>>  from the X selection, before it decodes it?

> I set it to "raw-text" and Emacs sees:

>     %/1€Œiso8859-15...

It seems this is a format for "Non-Standard Character Set
Encodings" (attached at the tail).  I heard that XFree86
started to use this for several charsets (e.g. Big5).

Unfortunately, Emacs's ctext decoder doesn't handle that
format.  To implement it, we need a consensus about which
string (in the part "iso8859-15" above) corresponds to which
encoding (or charset).  From the above example, it seems
that CHARSET_REGISTRY-CHARSET_ENCODING of X's font name is
used.  Is that generally correct?

---
Ken'ichi HANDA
handa@etl.go.jp

----------------------------------------------------------------------
6.  Non-Standard Character Set Encodings

Character set encodings that are not in the list of approved
standard encodings can be included using ``extended seg-
ments''.  An extended segment begins with one of the follow-
ing sequences:

     01/11 02/05 02/15 03/00 M L   variable number of octets per character
     01/11 02/05 02/15 03/01 M L   1 octet per character
     01/11 02/05 02/15 03/02 M L   2 octets per character
     01/11 02/05 02/15 03/03 M L   3 octets per character
     01/11 02/05 02/15 03/04 M L   4 octets per character

[This uses the ``other coding system'' of ISO 2022, using
private Final characters.]

The ``M'' and ``L'' octets represent a 14-bit unsigned value
giving the number of octets that appear in the remainder of
the segment.  The number is computed as ((M - 128) * 128) +
(L - 128).  The most significant bit M and L are always set
to one.  The remainder of the segment consists of two parts,
the name of the character set encoding and the actual text.
The name of the encoding comes first and is separated from
the text by the octet 00/02 (STX, START OF TEXT).  Note that
the length defined by M and L includes the encoding name and
separator.

[The encoding of the length is chosen to avoid having zero
octets in Compound Text when possible, because embedded NUL
values are problematic in many C language routines.  The use
of zero octets cannot be ruled out entirely however, since
some octets in the actual text of the extended segment may
have to be zero.]

The name of the encoding should be registered with the X
Consortium to avoid conflicts and should when appropriate
match the CharSet Registry and Encoding registration used in
the X Logical Font Description.  The name itself should be
encoded using ISO 8859-1 (Latin 1), should not use question
mark (03/15) or asterisk (02/10), and should use hyphen
(02/13) only in accordance with the X Logical Font Descrip-
tion.

Extended segments are not to be used for any character set
encoding that can be constructed from a GL/GR pair of
approved standard encodings. For example, it is incorrect to
use an extended segment for any of the ISO 8859 family of
encodings.

It should be noted that the contents of an extended segment
are arbitrary; for example, they may contain octets in the
C0 and C1 ranges, including 00/00, and octets comprising a
given character may differ in their most significant bit.

[ISO-registered ``other coding systems'' are not used in
Compound Text; extended segments are the only mechanism for
non-2022 encodings.]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]