[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg

From: Tobia Conforto
Subject: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Date: Tue, 18 Mar 2008 21:21:38 +0100

John Cowan wrote:
If we all lived in a UTF-8/LF world exclusively, then that would be fine. As it is, many of us are not in that world at all, and few of us are in it exclusively. So in practice it is necessary to convert between internal and external encodings anyhow, which involves copying in the general case.

Let's see... ASCII is valid UTF-8, so all ASCII external representations wouldn't need any encoding or decoding work. Most recent formats and protocols require or strongly recommend UTF-8 (see XML etc.) so those wouldn't need any encoding/decoding either. As far as internal representations covering all Unicode go, UTF-8 looks like the one incurring in the less overhead, in the general case. Not to mention the less work on the developer side, as we already have the utf8 egg!

at the expense of changing the meaning of the string API for all existing applications.

Not the *meaning* of it, just the big-O.

The meaning too. Right now eggs are using string-length to mean character-length in some cases and byte-length in others. With the proposed change they would use string-length for the former and byte- string-length for the latter.

"A Lisp programmer is someone who knows the value of everything and the cost of nothing."


I don't know... I personally think UTF-8 is quite efficient, both space- (of course) and time-wise. Random Wikipedia Author seems to concur:

while a fixed number of bytes per code point may seem convenient at first it isn't really used that much. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases since even with a “fixed width” font there may be more than one code point per character position (combining marks) or indeed more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing.

The dynamic nature of Scheme suggests that it will all work seamlessly, until someone tries to call a (now Unicode-aware) string-length on a string whose UTF-8 structure had been corrupted with byte-level operations. At which point a runtime error will kindly signal the situation

In the reverse case, though, the contents of the string will be silently corrupted, as when I change the nth character of an ASCII string to #\U4E00 and the n+1th and n+2th characters are destroyed.

Not really.

Unicode/UTF8-aware string operations will perform a correct replacement and insert the two extra bytes, if the source string really is plain ASCII. If the source string (or just the part near the change) is not correct UTF-8 or ASCII to begin with, they will raise an error.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]