[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using libunistring for string comparisons et al

From: Mike Gran
Subject: Re: Using libunistring for string comparisons et al
Date: Tue, 15 Mar 2011 13:39:02 -0700 (PDT)

> From:Mark H Weaver <address@hidden>
> Mike Gran <address@hidden> writes:
> > We do, in a matter of speaking, have a single string representation:
> > UTF-32.  The 'narrow' encoding is UTF-32 with the initial 3 bytes 
> of
> > zero removed.
> Despite the similarity of these two representations, they are
> sufficiently different that they cannot be handled by the same machine
> code.  That means you must either implement multiple inner loops, one
> for each combination of string parameter representations, or else you
> must dispatch on the string representation within the inner loop.  On
> modern architectures, wrongly predicted conditional branches are very
> expensive.

Keep in mind that the UTF-8 forward iterator operation has conditional
branches.  Merely the act of advancing from one character to another
could take one of four paths, or more if you include the possibility
of invalid UTF-8 sequences.

Also, you need to wrap what libunistring does into your calculations
of what is expensive.

> > I actually at one point had a nearly complete version of Guile 1.8
> > that used UTF-8 and another that used UTF-32.  There are some
> > other reasons why UTF-8 is bad, which I could bore you with
> > ad naseum.
> Can you please tell me why UTF-8 is bad, or point me to something that
> explains it?  Everything I have found suggests that UTF-8 is very good.


Well, we covered O(1) vs O(n).  To make UTF-8 O(1), you need to store
additional indexing information of some sort.  There are various schemes,
but, depending the the scheme, you lose some of memory advantage of UTF-8
vs UTF-32.  You can likely to better than UTF-32, though.

Next, you have the question of string character validation.  For example,
in Latin-1, all 8-bit values are valid characters.  In UTF-32, not all
32-bit values are valid codepoints: anything over 10FFFF or in the surrogate
region D800 to D8FF is invalid.  Likewise, in UTF-8, you can have invalid
sequences.  You can also have valid UTF-8 sequences that point to the
invalid codepoints in the surrogate region.  In our current scheme, we
test the validity the codepoints in UTF-32 strings when the strings
are created.

So, one of the supposed benefits of UTF-8 strings for Guile is that you
can more easily pass information back and forth between Guile and external
libraries or programs.  But in that case, when UTF-8 strings are received
by a UTF-8-enabled Guile from an external source,
you have to deal with both its O(1)-ness, the validity of its UTF-8
sequences, and the validity of the codepoints that the UTF-8 sequences
decode to (if the string is passed back from program or library that you
don't completely trust.)

So the question becomes *when* you'd deal with those questions.  The best
policy would be to deal with them at the point a UTF-8 string enters your
application.  When a string enters your application, you should scrub it
for invalid sequences and codepoints, throwing errors or replacing
bad characters with the Unicode replacement character and whatnot.

If you don't deal with O(1) indexing and invalid sequences at the point
a UTF-8 string enters the system, you will have a couple of
interesting effects.  First, UTF-8 iterators that have to deal with the
possibility of invalid sequences are less efficient that those that can
trust the validity of a sequence, because of the extra code involved.

And then what do you do when you find an invalid sequence?  If you throw
an error, will there the possibility of non-local exits at any time there
is a string iteration?  (You can't replace bad sequences with the
the Unicode replacement character on the fly as bad sequences are discovered
during iteration, because the string could be read-only or be being used
by another thread.)

And then there is a code-correctness argument against UTF-8.  It is just
too easy to confuse a string's codepoint count with its memory size, and
it is just too easy to confuse iteration over bytes with iteration over
characters.  I know that the Guile maintainers are geniuses, but, they
are still human.  (I don't have any real data for this, but anecdotally
it seems that languages that have gone the ISO-8859-n to 8-bit clean to
UTF-8 upgrade route seem to have taken longer to get up to speed.)  But
this argument, for Guile, is now largely moot, since much of the string
access is through encoding-agnostic accessor functions.

None of this is insurmountable, of course; all of this could be overcome,
if the costs outweighed the benefits.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]