Re: Wide strings status

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wide strings status

From:	Ludovic Courtès
Subject:	Re: Wide strings status
Date:	Tue, 21 Apr 2009 23:37:35 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.0.90 (gnu/linux)

Hello!

Mike Gran <address@hidden> writes:

> Strings are internally encoded either as "narrow" 8-bit ISO-8859-1
> strings or as "wide" UTF-32 strings.  Strings are usually created as
> narrow strings.  Narrow strings get automatically widened to wide
> strings if non-8-bit characters are set! or appended to them.

Great!

> The machine-readable "write" form of strings has been changed.  Before,
> non-printable characters were given as hex escapes, for example \xFF.
> Now there are three levels of hex escape for 8, 16, and 24 bit
> characters: \xFF, \uFFFF, \UFFFFFF.  This is a pretty common convention.
> But after I coded this, I noticed that R6RS has a different convention
> and I'll probably go with that.

OK.  I think it's probably good to follow R6RS when it has something to
say.

> The internal representation of strings seems to work already, but, the
> reader doesn't work yet.  For now, one can make wide strings like this:
>
>> (setlocale LC_ALL "")
> ==> "en_US.UTF-8"
>
>> (define str (apply string (map integer->char '(100 200 300 400 500))))
>
>> (write str)
> ==>"d\xc8\u012c\u0190\u01f4"
>
> (display str)
> ==>dÈĬƐǴ

Eh eh, looks nice.  Looking forward to typing `(λ (x y) (+ x y))'.  ;-)

> This is all going to be slower than before because of the string
> conversion operations, but, I didn't want to do any premature
> optimization.  First, I wanted to get it working, but, there is plenty
> of room for optimization later.

Good.  Maybe it'd be nice to add simple micro-benchmarks for
`string-ref', `string-set!' et al. under `benchmarks'.

> Character encoding needs to be a property of ports, so that not all
> string operations are done in the current locale.  This is necessary so
> that UTF-8-encoded source files are not interpreted differently based on
> the current locale.

You seem to imply that `scm_getc ()' will now return a Unicode
codepoint, is that right?  What about `scm_c_{read,write} ()', and
`scm_{get,put}s ()'?

> The VM and interpreter need to be updated to deal with wide chars and
> probably in other ways that are unclear to me now.  Wide strings are
> currently getting truncated to 8-bit somewhere in there.

The compiler could use bytevectors when dealing with bytecode.  Maybe
that would clarify things.

Thanks,
Ludo'.

[Prev in Thread]

Current Thread

[Next in Thread]

Wide strings status, Mike Gran, 2009/04/20
- Re: Wide strings status, Ludovic Courtès <=
  - Re: Wide strings status, Mike Gran, 2009/04/21
    - Re: Wide strings status, Ludovic Courtès, 2009/04/22

Prev by Date: New commit notification mailing list
Next by Date: Re: Merging Guile-R6RS-Libs in `master'
Previous by thread: Wide strings status
Next by thread: Re: Wide strings status
Index(es):
- Date
- Thread