[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings
From: |
Eric Blake |
Subject: |
Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings |
Date: |
Mon, 20 Sep 2010 09:06:10 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3 |
On 09/19/2010 07:13 AM, Bruno Haible wrote:
Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.
Particularly on platforms like Cygwin where sizeof(wchar_t) is 2, so you
already have the complication of dealing with surrogate pairs to
represent all possible Unicode characters (that is, cygwin disobeys the
rule that you have a 1-to-1 mapping between characters and wchar_t,
since there are some characters that require 2 wchar_t).
It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.
Ah, but cygwin's approach is to convert invalid byte sequences into the
second half of a Unicode surrogate pair. This is still recognizable in
UTF-8 processing as an invalid character, but has the advantage that it
can still be handled like any other valid UTF-8 encoding for determining
how many bytes form each processing unit, and can be mapped 1-to-1 back
to the original invalid byte sequence. Thus, any byte sequence of any
locale can be converted into this extended UTF-8 scheme, operations
performed in UTF-8, then finally mapped back to the original locale, all
while preserving the invalid byte sequences in the original locale
untouched by the UTF-8 processing.
For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a "multibyte character" datatype that accommodates also invalid
byte sequences.
Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.
Does it make sense to add some extended UTF-8 support into libunistring,
then?
--
Eric Blake address@hidden +1-801-349-2682
Libvirt virtualization library http://libvirt.org