[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Using libunistring for string comparisons et al
From: |
Mark H Weaver |
Subject: |
Using libunistring for string comparisons et al |
Date: |
Fri, 11 Mar 2011 17:33:47 -0500 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) |
Mike Gran <address@hidden> writes:
> [...] But doing the upper->lower operation picks
> up a few more of the corner cases, like U+03C2 GREEK
> SMALL LETTER FINAL SIGMA and U+03C3 GREEK SMALL LETTER SIGMA
> which are the same letter with different representations,
> or U+00B5 MICRO SIGN and U+039C GREEK SMALL LETTER MU
> which are supposed to have the same sort ordering.
Ah, okay. Makes sense.
> Now that we've pulled in all of libunistring, it might
> be a good idea to see if it has a complete implementation
> of unicode case folding, because upper->lower is also not
> completely correct.
I looked into this. Indeed, the libunistring documentation mentions
that in some languages (e.g. German), the to_upper and to_lower
conversions cannot be done properly on a per-character basis, because
the number of character can change. These operations much be done on an
entire string. For example:
<http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html>
(string-upcase "Straße") => "STRASSE"
(string-foldcase "Straße") => "strasse"
libunistring contains all the necessary functions, including
case-insensitive string comparisons. However, the only string
representations supported by these operations are: UTF-8, UTF-16,
UTF-32, or locale-encoded strings, and for comparisons both strings must
be the same encoding.
I'm aware that this proposal will be very controversial, but starting in
Guile 2.2, I think we ought to consider storing strings internally in
UTF-8, as is done in Gauche. This would of course make string-ref and
string-set! into O(n) operations. However, I claim that any code that
depends on string-ref and string-set! could be better written
- uc_tolower (uc_toupper (x)), Mark H Weaver, 2011/03/10
- Re: uc_tolower (uc_toupper (x)), Mike Gran, 2011/03/10
- Using libunistring for string comparisons et al,
Mark H Weaver <=
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/11
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/11
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/12
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/12
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/13
- Re: Using libunistring for string comparisons et al, Andy Wingo, 2011/03/30
- O(1) accessors for UTF-8 backed strings, Mark H Weaver, 2011/03/12
- Re: O(1) accessors for UTF-8 backed strings, Alex Shinn, 2011/03/12
- Re: O(1) accessors for UTF-8 backed strings, Mark H Weaver, 2011/03/15
- Re: O(1) accessors for UTF-8 backed strings, Alex Shinn, 2011/03/15