[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: guarantees of u8_mbtouc/u8_strmbtouc

From: Paolo Bonzini
Subject: Re: guarantees of u8_mbtouc/u8_strmbtouc
Date: Sat, 31 Jul 2010 22:29:07 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100621 Fedora/3.0.5-1.fc13 Lightning/1.0b2pre Thunderbird/3.0.5

On 07/31/2010 10:24 PM, Bruno Haible wrote:
Hi Paolo,

Still, without safety u8_strmbtouc(puc, s) uses the same code as
u8_mbtouc(puc, s, SIZE_MAX), which makes pretty much my point.  I think
it is safe and actually very useful to document u8_mbtouc/u16_mbtouc as
looking only one byte (resp. one short) beyond the first complete character.

I find it better to have clear specifications that the programmer can easily
remember. The libunistring manual [1] states:
   "Argument pairs (s, n) denote a string s[0..n-1] with exactly n units."

If we were to document "u8_mbtouc accesses only as many bytes as the first
Unicode character makes up", the question immediately comes up: what about
invalid and incomplete Unicode characters? Like
    { 0xC3 }, n = 1
or { 0xE4, 0x30 } n = 2.
You see how such a definition quickly gets ambiguous. Such ambiguities later
lead to bugs in the programs.

"u8_mbtouc will never access more than N bytes. However, as an additional guarantee, u8_mbtouc only accesses as many bytes as necessary to decode the first Unicode character, or to ascertain that S does not begin with a valid UTF-8 sequence."

This is exactly what the code does.

The code may be changed in the future. If a guarantee is not documented AND
checked by the test suite, you cannot rely on it.

Of course, that's why I'm suggesting a modification to the specification.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]