Re: [Chicken-users] ditching syntax-case modules for the utf8 egg

chicken-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg

From:	Tobia Conforto
Subject:	Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Date:	Tue, 18 Mar 2008 21:21:38 +0100

John Cowan wrote:

If we all lived in a UTF-8/LF world exclusively, then that would befine. As it is, many of us are not in that world at all, and few ofus are in it exclusively. So in practice it is necessary to convertbetween internal and external encodings anyhow, which involvescopying in the general case.

Let's see... ASCII is valid UTF-8, so all ASCII externalrepresentations wouldn't need any encoding or decoding work. Mostrecent formats and protocols require or strongly recommend UTF-8 (seeXML etc.) so those wouldn't need any encoding/decoding either. As faras internal representations covering all Unicode go, UTF-8 looks likethe one incurring in the less overhead, in the general case. Not tomention the less work on the developer side, as we already have theutf8 egg!

at the expense of changing the meaning of the string API for allexisting applications.
Not the *meaning* of it, just the big-O.

The meaning too. Right now eggs are using string-length to meancharacter-length in some cases and byte-length in others. With theproposed change they would use string-length for the former and byte-string-length for the latter.

"A Lisp programmer is someone who knows the value of everything andthe cost of nothing."


LOL!!

I don't know... I personally think UTF-8 is quite efficient, bothspace- (of course) and time-wise. Random Wikipedia Author seems toconcur:

while a fixed number of bytes per code point may seem convenient atfirst it isn't really used that much. It makes truncation slightlyeasier but not significantly so compared to UTF-8 and UTF-16. Itdoes not make calculating the displayed width of a string any easierexcept in very limited cases since even with a “fixed width” fontthere may be more than one code point per character position(combining marks) or indeed more than one character position percode point (for example CJK ideographs). Combining marks also meaneditors cannot treat one code point as being the same as one unitfor editing.

The dynamic nature of Scheme suggests that it will all workseamlessly, until someone tries to call a (now Unicode-aware)string-length on a string whose UTF-8 structure had been corruptedwith byte-level operations. At which point a runtime error willkindly signal the situation
In the reverse case, though, the contents of the string will besilently corrupted, as when I change the nth character of an ASCIIstring to #\U4E00 and the n+1th and n+2th characters are destroyed.


Not really.

Unicode/UTF8-aware string operations will perform a correctreplacement and insert the two extra bytes, if the source stringreally is plain ASCII. If the source string (or just the part nearthe change) is not correct UTF-8 or ASCII to begin with, they willraise an error.



Tobia

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg, (continued)

Prev by Date: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Next by Date: [chicken-users] silex GPL-2 licensed?
Previous by thread: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Next by thread: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Index(es):
- Date
- Thread