Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings

From:	Eric Blake
Subject:	Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Date:	Mon, 20 Sep 2010 09:06:10 -0600
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3

On 09/19/2010 07:13 AM, Bruno Haible wrote:

Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.

Particularly on platforms like Cygwin where sizeof(wchar_t) is 2, so youalready have the complication of dealing with surrogate pairs torepresent all possible Unicode characters (that is, cygwin disobeys therule that you have a 1-to-1 mapping between characters and wchar_t,since there are some characters that require 2 wchar_t).


It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.

Ah, but cygwin's approach is to convert invalid byte sequences into thesecond half of a Unicode surrogate pair. This is still recognizable inUTF-8 processing as an invalid character, but has the advantage that itcan still be handled like any other valid UTF-8 encoding for determininghow many bytes form each processing unit, and can be mapped 1-to-1 backto the original invalid byte sequence. Thus, any byte sequence of anylocale can be converted into this extended UTF-8 scheme, operationsperformed in UTF-8, then finally mapped back to the original locale, allwhile preserving the invalid byte sequences in the original localeuntouched by the UTF-8 processing.

For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a "multibyte character" datatype that accommodates also invalid
byte sequences.

Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.

Does it make sense to add some extended UTF-8 support into libunistring,then?


--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[Prev in Thread]

Current Thread

[Next in Thread]

[coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/13
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/14
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/20
    - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/20
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Jim Meyering, 2010/09/15
- [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/19
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/20
  - Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings, Eric Blake <=

Prev by Date: Re: [coreutils] -avrd option ?
Next by Date: [coreutils] [PATCH] sort: destroy spin locks portably
Previous by thread: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Next by thread: [coreutils] [PATCH] dircolors: add rxvt-unicode-256color terminal type
Index(es):
- Date
- Thread