chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg


From: John Cowan
Subject: Re: [Chicken-users] ditching syntax-case modules for the utf8 egg
Date: Tue, 18 Mar 2008 23:56:48 -0400
User-agent: Mutt/1.5.13 (2006-08-11)

Shawn Rutledge scripsit:

> But you would want the usual string operations to work with either
> kind of string, right?  

Indeed.

> It could follow from the general principle of separating metadata from
> data: Put the encoding in the extended attributes of the file, or
> resource fork if you've got one.  

Specifically, the 8-BOM interferes with the ability of ASCII-aware but
8-bit clean programs to treat UTF-8 the same as ASCII.  When they expect
to see something specific (like #!) at the beginning, they see the 8-BOM
instead and barf.

I'm all in favor of the 16-BOM, where there are no such issues, and
it also serves to reliably flag UTF-16/UCS-2 and to allow for variable
endianism.  Same with the 32-BOM, if anyone bothers to use UTF-32 for
interchange.

> I thought it was still a reasonable assumption most of the time,

Except when it isn't.  ASCII is a reasonable assumption most of the time,
except when it isn't.

> Or have 4 types of strings: byte (restricted strings), UTF-8, and
> fixed-char-size 16- and 24-bit strings.  

Check out http://larceny.ccs.neu.edu/larceny-trac/wiki/StringRepresentations ,
then let's talk, if there's anything left to talk about.  :-)

-- 
We are lost, lost.  No name, no business, no Precious, nothing.  Only empty.
Only hungry: yes, we are hungry.  A few little fishes, nassty bony little
fishes, for a poor creature, and they say death.  So wise they are; so just,
so very just.  --Gollum        address@hidden  http://ccil.org/~cowan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]