coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings


From: Eric Blake
Subject: Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Date: Mon, 20 Sep 2010 09:06:10 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3

On 09/19/2010 07:13 AM, Bruno Haible wrote:
Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.

Particularly on platforms like Cygwin where sizeof(wchar_t) is 2, so you already have the complication of dealing with surrogate pairs to represent all possible Unicode characters (that is, cygwin disobeys the rule that you have a 1-to-1 mapping between characters and wchar_t, since there are some characters that require 2 wchar_t).


It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.

Ah, but cygwin's approach is to convert invalid byte sequences into the second half of a Unicode surrogate pair. This is still recognizable in UTF-8 processing as an invalid character, but has the advantage that it can still be handled like any other valid UTF-8 encoding for determining how many bytes form each processing unit, and can be mapped 1-to-1 back to the original invalid byte sequence. Thus, any byte sequence of any locale can be converted into this extended UTF-8 scheme, operations performed in UTF-8, then finally mapped back to the original locale, all while preserving the invalid byte sequences in the original locale untouched by the UTF-8 processing.

For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a "multibyte character" datatype that accommodates also invalid
byte sequences.

Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.

Does it make sense to add some extended UTF-8 support into libunistring, then?

--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]