[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Wide strings status
Wide strings status
Mon, 20 Apr 2009 19:11:48 -0700
OK. I've uploaded a "string-abstraction" branch so that you can see
what I've been doing over the last couple of months. Currently, I do
have a version of Guile that uses Unicode codepoints for characters.
The C representation of chars was changed to scm_t_uint32 throughout the
Strings are internally encoded either as "narrow" 8-bit ISO-8859-1
strings or as "wide" UTF-32 strings. Strings are usually created as
narrow strings. Narrow strings get automatically widened to wide
strings if non-8-bit characters are set! or appended to them.
Outside of the core strings module and srfi-13, a set of methods are
used to access strings. I did my best to keep the internal
representation of strings isolated to those two modules. This means
that almost every instance of the pervasive scm_i_string_chars() was
The machine-readable "write" form of strings has been changed. Before,
non-printable characters were given as hex escapes, for example \xFF.
Now there are three levels of hex escape for 8, 16, and 24 bit
characters: \xFF, \uFFFF, \UFFFFFF. This is a pretty common convention.
But after I coded this, I noticed that R6RS has a different convention
and I'll probably go with that.
The internal representation of strings seems to work already, but, the
reader doesn't work yet. For now, one can make wide strings like this:
> (setlocale LC_ALL "")
> (define str (apply string (map integer->char '(100 200 300 400 500))))
> (write str)
This is all going to be slower than before because of the string
conversion operations, but, I didn't want to do any premature
optimization. First, I wanted to get it working, but, there is plenty
of room for optimization later.
Anyway, if, code-wise, it is agreed that I'm generally on the right
track, the next steps are these:
Write a plethora of unit tests on what has been accomplished so far.
Character sets need to be modified to have more than 256 entries.
Character encoding needs to be a property of ports, so that not all
string operations are done in the current locale. This is necessary so
that UTF-8-encoded source files are not interpreted differently based on
the current locale.
For programs that have been abusing strings for containing binary data,
some accommodation needs to be made. Maybe make a "binary" locale.
The VM and interpreter need to be updated to deal with wide chars and
probably in other ways that are unclear to me now. Wide strings are
currently getting truncated to 8-bit somewhere in there.
- Wide strings status,
Mike Gran <=