[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistrin
From: |
Bruno Haible |
Subject: |
[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring |
Date: |
Sat, 13 Nov 2010 18:53:27 +0100 |
User-agent: |
KMail/1.9.9 |
Hello Ben,
> But it has only the u8_prev() function for iterating backward.
> That function has the pitfall that it only operates on
> well-formed UTF-8 sequences
Indeed. I'm adding a note about it to the manual.
> (the manual also implies that it only
> works on null-terminated UTF-8 strings, but in fact it doesn't).
Indeed. But I wanted to have u8_prev documented near u8_next, and
u8_next _does_ assume a NUL-terminated UTF-8 string.
> Consider how u8_mbtouc() treats ill-formed sequences.
> There are three cases (the examples show byte sequences and the
> code points for the immediately preceding bytes in parentheses):
>
> (a) For an incomplete sequence, it reports the whole incomplete
> sequence as a single code point U+FFFD.
>
> e0 a0 (U+FFFD)
>
> (Only 2 bytes in 3-byte sequence.)
>
> (b) For a sequence that is invalid in UTF-8, but that would be
> valid if overlong sequences or invalid Unicode code points
> were allowed, it reports the whole invalid sequence as a
> single code point U+FFFD.
>
> e0 80 (U+FFFD)
>
> (Not a prefix of any valid UTF-8 sequence, because the
> second byte must be at 0xa0 when the first byte is 0xe0.)
>
> f5 80 80 (U+FFFD)
>
> (This would be greater than the maximum code point U+10FFFF
> if it was allowed.)
>
> (c) For an invalid (but complete) sequence, it reports each
> byte as a separate code point U+FFFD.
>
> c0 (U+FFFD) 41 (U+0041)
>
> (c0 never appears in UTF-8)
>
> e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)
>
> (This would be a UTF-16 surrogate if it was allowed.)
>
> e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)
>
> (e1 starts a 3-byte sequence but 00 is invalid as the
> third byte.)
Part (c) is actually a bug. Now I'm looking at Markus Kuhn's recommendations
how to parse UTF-8
<http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>
<http://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html>
and am finding that u8_mbtouc needs to be corrected so that in case (c)
the results are:
c0 (U+FFFD) 41 (U+0041)
e1 e1 80 (U+FFFD)
e0 a0 (U+FFFD) 00 (U+0000)
> I came up with two requirements:
>
> 1. Forward and backward iteration of the same bytes report the
> same code points (in opposite order, of course).
>
> 2. Backward iteration does not require any more context than
> forward iteration does.
> ...
> I believe that a function for backward
> iteration can satisfy either property 1 or property 2 above, but
> not both
Thanks for investigating this in so much depth. I probably guessed
that it was this way when I wrote the u8-prev.c code, but I'm glad
you have thought through it!
> Given these three cases, I believe that a function for backward
> iteration can satisfy either property 1 or property 2 above, but
> not both:
>
> * Clearly it is possible to satisfy property 1, ignoring
> efficiency: iterate forward through the entire UTF-8 byte
> sequence, recording the code points and their positions as
> they are generated, and then return them during backward
> iteration. But such an implementation would use extra state
> not allowed by property 2.
>
> * If the backward iteration function only uses the same amount
> of state as u8_mbtouc(), that is, a pointer to the beginning
> of the string and the number of bytes in it, then this
> satisfies property 2. But some of the cases above cannot be
> distinguished. For example, consider the sequence e0 a0 00:
>
> - In forward iteration, this is case (c) for u8_mbtouc(),
> and each byte will be treated as a separate code point.
>
> - In backward iteration, after returning the final byte 00
> as code point U+0000, the reverse iteration code cannot
> tell whether the remaining bytes e0 a0 should be treated
> as case (a), a single U+FFFD code point, or case (c), two
> separate U+FFFD code points, because it does not know
> whether following bytes were trimmed off earlier.
> Distinguishing these cases requires additional state,
> which would violate property 2.
After correcting the forward iteration, the answer for e0 a0 00
is clear: e0 a0 must be grouped together, for a single U+FFFD.
Are there other cases where the forward iteration behaviour does
not allow an equivalent O(1) backward iteration?
> So, I'd like to propose the following:
>
> 1. Change libunistring functions that detect ill-formed UTF-8
> sequences to return each byte of an ill-formed sequence as a
> separate U+FFFD code point (and for case (b) to return these
> even when, e.g. e0 80 is seen but a third byte isn't
> available). (This actually simplifies code.)
No, according to the guidelines set out by Markus Kuhn and republished
by the W3C it is better to return a single U+FFFD for a sequence like
e0 80 or e0 80 bf.
But it is well possible that with this changed behaviour of u8_mbtouc
you can write an u8_prev function that also works for invalid input
and yet satisfies both of your requirements. No?
> 2. Add libunistring functions to get the last code point out of
> a UTF-8 string. Tentatively I was planning to add a "r"
> prefix (e.g. u8_rmbtouc(), u8_rmbtoucr()) but other
> conventions are fine too.
'r' is already used as a suffix, therefore I would not use that.
Maybe u8_mb_prev_uc is a better name for such a function?
> Bruno, does this sound like a worthwhile project, and would you
> accept this kind of contribution, if it was written following the
> existing libunistring conventions, etc.?
Yes, if it satisfies your two requirements and is consistent with the
new forward iteration behaviour (modified as of today).
Bruno