bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-libunistring] roundtrippable encoding support


From: David Kastrup
Subject: [bug-libunistring] roundtrippable encoding support
Date: Thu, 09 Oct 2014 18:04:02 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Hi,

I'm probably going to get a lot of heat for this, but...  At the current
point of time there is sort of a discussion raging on (or by now raging
off) regarding string and transfer encoding strategies, and that is
partly based on different philosophies between Emacs and GUILE.

Now GUILE internally employs libunistring, so its design decisions
obviously strongly favor possibilities available in libunistring.

Looking at the libunistring documentation, I find that the available
options actually do not make Emacs-like behavior feasible in an
efficient manner.

Now the essence of the situation boils down to the most common encodings
of Emacs being round-trippable: if I load a file declared to be "utf-8",
I can save it as "utf-8", and even if it is straight from /dev/random,
it will reproduce the original byte sequence regardless of whether it
constitutes valid utf-8 or not.

The way this is done in multibyte strings is that any stray byte not
being part of a valid minimal utf-8 sequence (such bytes can take the
values 128-255 as ASCII characters are always valid) is encoded into a
2-byte overlong representation of character codes 0-127, namely
0xc0 0x80 to 0xc1 0xbf.  So the growth factor of a file is limited.

When interpreted as character codes, these patterns representing single
bytes are outside of the range covered by Unicode (starting at 0x3fff80,
actually).  Emacs does actually support character codes in that range:
0x3fff00 is still encodable and represented with the byte pattern

0xf8 0x8f 0xbf 0xbc 0x80

namely a 5-byte sequence in the basic UTF-8 encoding scheme.  I think
that the Emacs character set ends with the last 128 characters encoded
as 2-byte sequences.  Emacs uses the extended character ranges beyond
Unicode to represent various Asian character sets that are, according to
users of those character sets, not adequately represented in Unicode.

Now the particular extended character range is something that is quite
particular to Emacs.

What I am actually more interested in is in having libunistring offer
"roundtrippable" encodings as a fallback for decoding errors.
Basically, I want an option for decoding where libunistring announces
"what you have here is not valid utf-8 but I know how to deal with it".
Including reencoding.  And delivering unique "character codes" and
string length calculations.  The application would either keep track of
having received "dirty utf-8" and would reencode when putting out utf-8
(where reencoding "internal utf-8" to "external utf-8" means replacing
the 2-byte sequences representing a wild byte by their original byte),
or it would reencode into "external" utf-8 when writing anyway which
would not change anything for originally valid utf-8.

The basic point would be to be able to process any input assuming a
specified locale with graceful degradation where the locale assumption
is violated.  For example, a regular expression replacement of text from
a mixed text/binary file (like PostScript often is) without affecting
the binary passages.

Not requiring a latin-1 interpretation of the input in order to be able
to use the internal UTF-8 encoding based string processing for lossless
input processing from a file or a terminal or network connection or
other sources provides additional flexibility for an application using
libunistring.

The support would basically come in 3 parts:

a) decoding and encoding strategies that allow "escape code"
representation of raw bytes not fitting into regular UTF-8.

b) a unique character code returned when converting into a character
code

c) guarantees about the processing of those sequences that are most
likely already met since one can fit them into the normal patterns of
UTF-8 encoding reasonably well.

Character ranges in regular expressions, upper- and lowercasing and some
other operations strongly related to character code points would likely
require checking and possibly changes.

Now I cannot vouch for the actual interest of GUILE developers in
roundtripping coding system and/or conversions.  I suspect this also to
be a hen-and-egg situation where the availability in libunistring will
change the perception of desirability.

Independent of the potential use in GUILE, I would think that this sort
of functionality is desirable whenever one is not integrating
libunistring in a basic text processing application but rather a
programming platform.  In that case, an internal representation that can
reflect arbitrary input accurately even when basically interpreted as
utf-8 seems like a definite advantage to me.

Thoughts?

-- 
David Kastrup



reply via email to

[Prev in Thread] Current Thread [Next in Thread]