[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistrin

bug-libunistring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistrin

From:	Bruno Haible
Subject:	[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring
Date:	Sat, 13 Nov 2010 18:53:27 +0100
User-agent:	KMail/1.9.9

Hello Ben,

> But it has only the u8_prev() function for iterating backward.
> That function has the pitfall that it only operates on
> well-formed UTF-8 sequences

Indeed. I'm adding a note about it to the manual.

> (the manual also implies that it only 
> works on null-terminated UTF-8 strings, but in fact it doesn't).

Indeed. But I wanted to have u8_prev documented near u8_next, and
u8_next _does_ assume a NUL-terminated UTF-8 string.

> Consider how u8_mbtouc() treats ill-formed sequences.
> There are three cases (the examples show byte sequences and the
> code points for the immediately preceding bytes in parentheses):
> 
>   (a) For an incomplete sequence, it reports the whole incomplete
>       sequence as a single code point U+FFFD.
> 
>         e0 a0 (U+FFFD)
> 
>           (Only 2 bytes in 3-byte sequence.)
> 
>   (b) For a sequence that is invalid in UTF-8, but that would be
>       valid if overlong sequences or invalid Unicode code points
>       were allowed, it reports the whole invalid sequence as a
>       single code point U+FFFD.
> 
>         e0 80 (U+FFFD)
> 
>           (Not a prefix of any valid UTF-8 sequence, because the
>           second byte must be at 0xa0 when the first byte is 0xe0.)
> 
>         f5 80 80 (U+FFFD)
> 
>           (This would be greater than the maximum code point U+10FFFF
>           if it was allowed.)
> 
>   (c) For an invalid (but complete) sequence, it reports each
>       byte as a separate code point U+FFFD.
> 
>         c0 (U+FFFD) 41 (U+0041)
> 
>           (c0 never appears in UTF-8)
> 
>         e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)
> 
>           (This would be a UTF-16 surrogate if it was allowed.)
> 
>         e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)
> 
>           (e1 starts a 3-byte sequence but 00 is invalid as the
>           third byte.)

Part (c) is actually a bug. Now I'm looking at Markus Kuhn's recommendations
how to parse UTF-8
   <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>
   <http://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html>
and am finding that u8_mbtouc needs to be corrected so that in case (c)
the results are:

          c0 (U+FFFD) 41 (U+0041)
          e1 e1 80 (U+FFFD)
          e0 a0 (U+FFFD) 00 (U+0000)

> I came up with two requirements:
> 
>   1. Forward and backward iteration of the same bytes report the
>      same code points (in opposite order, of course).
> 
>   2. Backward iteration does not require any more context than
>      forward iteration does.
> ...
> I believe that a function for backward
> iteration can satisfy either property 1 or property 2 above, but
> not both

Thanks for investigating this in so much depth. I probably guessed
that it was this way when I wrote the u8-prev.c code, but I'm glad
you have thought through it!

> Given these three cases, I believe that a function for backward
> iteration can satisfy either property 1 or property 2 above, but
> not both:
> 
>   * Clearly it is possible to satisfy property 1, ignoring
>     efficiency: iterate forward through the entire UTF-8 byte
>     sequence, recording the code points and their positions as
>     they are generated, and then return them during backward
>     iteration.  But such an implementation would use extra state
>     not allowed by property 2.
> 
>   * If the backward iteration function only uses the same amount
>     of state as u8_mbtouc(), that is, a pointer to the beginning
>     of the string and the number of bytes in it, then this
>     satisfies property 2.  But some of the cases above cannot be
>     distinguished.  For example, consider the sequence e0 a0 00:
> 
>       - In forward iteration, this is case (c) for u8_mbtouc(),
>         and each byte will be treated as a separate code point.
> 
>       - In backward iteration, after returning the final byte 00
>         as code point U+0000, the reverse iteration code cannot
>         tell whether the remaining bytes e0 a0 should be treated
>         as case (a), a single U+FFFD code point, or case (c), two
>         separate U+FFFD code points, because it does not know
>         whether following bytes were trimmed off earlier.
>         Distinguishing these cases requires additional state,
>         which would violate property 2.

After correcting the forward iteration, the answer for e0 a0 00
is clear: e0 a0 must be grouped together, for a single U+FFFD.

Are there other cases where the forward iteration behaviour does
not allow an equivalent O(1) backward iteration?

> So, I'd like to propose the following:
> 
>   1. Change libunistring functions that detect ill-formed UTF-8
>      sequences to return each byte of an ill-formed sequence as a
>      separate U+FFFD code point (and for case (b) to return these
>      even when, e.g. e0 80 is seen but a third byte isn't
>      available).  (This actually simplifies code.)

No, according to the guidelines set out by Markus Kuhn and republished
by the W3C it is better to return a single U+FFFD for a sequence like
e0 80 or e0 80 bf.

But it is well possible that with this changed behaviour of u8_mbtouc
you can write an u8_prev function that also works for invalid input
and yet satisfies both of your requirements. No?

>   2. Add libunistring functions to get the last code point out of
>      a UTF-8 string.  Tentatively I was planning to add a "r"
>      prefix (e.g. u8_rmbtouc(), u8_rmbtoucr()) but other
>      conventions are fine too.

'r' is already used as a suffix, therefore I would not use that.
Maybe u8_mb_prev_uc is a better name for such a function?

> Bruno, does this sound like a worthwhile project, and would you
> accept this kind of contribution, if it was written following the
> existing libunistring conventions, etc.?

Yes, if it satisfies your two requirements and is consistent with the
new forward iteration behaviour (modified as of today).

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-libunistring] UTF-8 backward iteration proposal for libunistring, Ben Pfaff, 2010/11/13
- Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring, Paolo Bonzini, 2010/11/13
  - Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring, Ben Pfaff, 2010/11/13
    - Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring, Bruno Haible, 2010/11/13
- [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Bruno Haible <=
  - [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Ben Pfaff, 2010/11/13
    - [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Bruno Haible, 2010/11/13
    - [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Ben Pfaff, 2010/11/13

Prev by Date: Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring
Next by Date: Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring
Previous by thread: Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring
Next by thread: [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring
Index(es):
- Date
- Thread