[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte support (round 2)

From: Assaf Gordon
Subject: Re: Multibyte support (round 2)
Date: Sun, 4 Sep 2016 00:51:59 -0400


Attached updated multibyte patch.
'unorm' now works on cygwin with utf-16 surrogates.
'expand' is slightly improved,
Few other bug fixes.

Some more minutia follows, either as basis for discussion, or if there's no 
disagreement, then mainly to document what is supported at the moment.

General multibyte input parsing (in ./src/mbbuffer.{c,h})

1. Uses mbrtowc(3), can handle all locales, does not assume any
specific implementation (e.g. utf-8 input or whcar_t==ucs4).

2. in UTF-8 locales, does *not* support "modified utf-8" with
the null character embedded as '\xC0\x80'.
I haven't seen a libc implementation that supports it natively.
If this is required, we'll need custom implementation or extra code.
Such input will be treated as invalid multibyte sequence (twice,
once for each octet).

3. Support for unicode code points of UTF-16 surrogate pairs depends
on the libc implementation (NOTE: such codes should never appear in normal
input, but still...).
On glibc, this will result in three consecutive "invalid multibyte sequence":

    $ printf '\uD800\n' | ./src/mbbuffer-test -r
    ofs  line colB colC V wc(dec) wc(hex) Ch w n octets
    0    1    1    1    n       *       * *  * 1 0xed
    1    1    2    2    n       *       * *  * 1 0xa0
    2    1    3    3    n       *       * *  * 1 0x80
    3    1    4    4    y      10 0x0000a =  -1 1 0x0a

But on Cygwin, it is acceptable as one valid wide-character:

    $  printf '\uD800\n' | ./src/mbbuffer-test.exe -r
    ofs  line colB colC V wc(dec) wc(hex) Ch  W n octets
    0    1    1    1    y   55296 0x0d800 =  -1 3 0xed 0xa0 0x80
    3    1    4    2    y      10 0x0000a =  -1 1 0x0a


1. I suspect the 'unicode normalization' feature will raise lots of 
questions/bug-reports, as
the normalization rules can be quite confusing (and inconsistent?).
one example (out of many):

U+FB01 LATIN SMALL LIGATURE FI (fi) is decomposed to 'f' and 'i':

  $ printf '\uFB01' | ./src/unorm -n nfkd \
                    | iconv -t UCS-2LE | od -An -tx2c
      0066    0069
     f  \0   i  \0

but U+00E6 LATIN SMALL LETTER AE (æ) was promoted to a 'real' character
and is never decomposed:

  $ printf '\u00e6' | ./src/unorm -n nfkd \
                    | iconv -t UCS-2LE | od -An -tx2c                           

There's not much we can do, except perhaps explain it or warn
that unicode-normalization is a complicated and delicate subject.

2. 'unorm' will convert *every* octet of an invalid/incomplete
multibyte sequence into a U+FFFD 'REPLACEMENT CHARACTER'.

    # a 4-octet sequence:
    $ printf '\U001F466' | od -An -to1
     360 237 221 246

    # without the last octet:
    $ printf '\360\237\221' | ./src/unorm

The page "UTF-8 decoder capability and stress test"
Mentions in section 3.3 that:

>All bytes of an incomplete sequence should be signalled as a single malformed 

'unorm' clearly does not do it - it displays one marker for each invalid octet.
However, almost every implementation I encountered also displays three markers.
It think keeping it this way is more reasonable, as it closely corresponds to 
'mbrtowc' reporting
'invalid sequence' three times,
and it will also be closer to the expectation of 'unorm' users who want to know 
all invalid
positions in their file (if we report only the first, they will need additional 
to know how many octets are invalid).


1. multibyte-expand is based on 'wcwidth(3)'.
It simplifies the implementation, and works well for most cases (i.e. 
non-spacing combining characters).
However, it fails in some newer characters.

Some examples:
The characters
   BOY (U+1F466)
When display separately, each takes 1 space (wcwidth returns '1'):
   $ printf '\U0001F466 \U0001F3FB\n'                                           
   👦 🏻    
But when they follow one another, some terminals (and other gui programs)
can merge them into one:

   $ printf '\U0001F466\U0001F3FB\n'                                            

NOTE: whether the above is shown as a single character or as two characters 
on your email client and operating system. It is rendered correctly at least on 
Mac OS X terminal.

Using 'wcswidth' on both characters together does not help, it still returns 

Another messy case: 'COMBINING ENCLOSING KEYCAP' (U+20E3) :

  $ printf 'a\u20E3aa\tb\naaaa\tb\n'
  a⃣aa     b
  aaaa    b

wcwidth returns '0' for this character, but on the terminal
(at least on Mac OS X) it is not rendered on the previous character
and thus consumes a column.

   $ printf 'aa\u20E0aa\tb\naaaa\tb\n'
   aa⃠aa    b
   aaaa    b

(NOTE: Whether the above renders correctly or not depends on your email client. 
Try it on the terminal too to see different rendering.)

In summary, I think that as long as multibyte-expand relies on 'wcwidth' - 
there's a limit of how good the expansion can work. If we switch to more 
complicated implementation based on unicode properties, perhaps
it will be possible to improve the output - but it will also always depend on 
the capabilities of the underlying operating system, and whether the output is 
viewed on a terminal or in a gui program.

3. I have not yet tested multibyte-expand with CJKV characters, where 'wcwidth' 
can return 2.

Comments very welcomed, 

 - assaf

Attachment: multibyte-2016-09-03.patch.xz
Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]