chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg


From: Shawn Rutledge
Subject: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Date: Tue, 18 Mar 2008 14:47:19 -0700

On Tue, Mar 18, 2008 at 1:53 PM, John Cowan <address@hidden> wrote:
>  > Let's see... ASCII is valid UTF-8, so all ASCII external
>  > representations wouldn't need any encoding or decoding work.

That is a huge advantage.  I think unless there are some
insurmountable gotcha's, or it causes major efficiency problems, there
are some good arguments for using UTF-8 for strings in Chicken.

>  True.  However, pure ASCII is less comment than people believe, as
>  indicated by the 59K Google hits for "8-bit ASCII".

Less common you mean?  I think ASCII is the most common representation
for everything.  The popularity of XML goes to show what pains people
are taking to make data human-readable.  (I disagree with the need for
that a lot of the time, but whatever.)  Source code written by
non-English speakers is usually ASCII nevertheless.  (Must be harder
to learn a language when you don't know what the keywords mean in
English.)  My favorite editor, Scite, BTW supports UTF-8 nicely... it
preserves the BOM if it is there, assumes ASCII if it is not there,
and can be told to switch to UTF8 mode if the file does not have a BOM
but actually is UTF8... then when you write the file it prepends the
BOM.  All exactly as it should be.

I am seeing fewer web pages in other 8-bit codepages (like KOI8-R,
CP1251 etc.) than there used to be, and/or modern browsers are doing a
better job detecting the codepage and making it transparent anyway.
On one hand it was nice to pick your language and still have 8-bit
strings.  OTOH it was really messy having 4 or so code pages to choose
from for Cyrillic (2 of which were used a lot); and it's also nice to
be able to mix languages, and insert the Euro symbol into any string,
etc.  MP3ID tags lagged for a long time... Russian MP3s tend to have
CP1251 (with no way to declare that's what it is, you just have to
know) but now UTF8 can be used there too.

>  > Most recent formats and protocols require or strongly recommend UTF-8
>  > (see XML etc.) so those wouldn't need any encoding/decoding either.
>
>  Well, there's an awful lot of content on the Internet and on local hard
>  disks that is neither true ASCII nor UTF-8.  In particular, UTF-16 is
>  the usual representation of Unicode on Windows, and various non-Unicode
>  character sets are the usual representation of text on Windows, and
>  consequently on the Web too.  UTF-8 is something of an oddity there.

I disagree.  Text and HTML files you may find lying about on hard
drives and web servers all over the world tend to be either ASCII or
UTF-8, as far as I've seen.  Windows programs may use UTF-16 for
string variables in memory, and maybe for serialization to "binary"
files, but not for files that are meant to be human-readable.

>  I'm fine with using UTF-8 as our internal representation.

Sounds good to me.

>  > Unicode/UTF8-aware string operations will perform a correct
>  > replacement and insert the two extra bytes, if the source string
>  > really is plain ASCII.

Insertion has a linear cost though, because the string is a contiguous
array, right?

This is probably the reason Java sidestepped the issue by specifying
that strings are immutable.  In a fairly pure functional language that
policy would make sense too (you can modify a string only by copying
it and throwing the old one away - that way you see more clearly what
is the cost of the operations you are doing) but we can't go breaking
existing programs can we...

So char has to be 16 or 32 bits right?  (depending on how much of
Unicode we wish to support... 16 bits is almost always enough)  When
you do string-ref on a UTF8 string it will need to return the Unicode
character at that character index, not the byte from the bytewise
index, right?  Then unfortunately you have to iterate the string in
order to count characters, can't just do an offset from the beginning.
 (This is where UTF-16 as an in-memory representation has an
advantage.)

For Display Scheme I was planning to assume all strings are UTF-8, so
this change will make things nice and consistent.  But I had to
convert on-the-fly to 16-bit Unicode to render characters with
FreeType.  (Not a big deal because I did the rendering 1 glyph at a
time anyway.)

http://dscm.svn.sourceforge.net/viewvc/dscm/src/g2d-fb16-impl.c?revision=67&view=markup
line 761  (sorry that code isn't very presentable yet and needs some
modularization)

Alternative string representations could be in an egg.  (16-bit
Unicode, 32-bit Unicode, string-plus-codepage, EBCDIC or whatever. :-)
 When doing in-place modifications with strings that actually have
non-ASCII characters, actual Unicode is more efficient, so it would be
nice to be able to switch to that representation when it has
advantages.  (like Windows does for string variables)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]