guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wide strings


From: Clinton Ebadi
Subject: Re: Wide strings
Date: Wed, 28 Jan 2009 15:44:25 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux)

Mike Gran <address@hidden> writes:

> Hi,
>
> Let's say that one possible goal is to add wide strings 
> * using Gnulib functions 
> * with minimal changes to the public Guile API 
> * where chars become 4-byte codepoints and strings are internally 
>  either UTF-32 or ISO-8859-1
>
> Since I need this functionality taken care of, and since I have some
> time to play with it, what's the procedure here? Should I mock
> something up and submit it as a patch?  If I did, it would likely be 
> a big patch.  Do we need to talk more about what needs to be
> accomplished?  Do we need a complete specification?  Do we need
> a vote on if it is a good idea?

You should take a look at Common Lisp strings[0] and streams[1]. The
gist is that a string is a uniform array of some subtype of
`character'[2], and character streams have an
:external-encoding--character data is converted to/from that format when
writing/reading the stream. Guile should be a bit more opaque and just
specify the string as being an ordered sequence of characters, and
providing conversion functions to/from uniform byte[^0] arrays in some
explicitly specified encoding.

The `scm_{to|from}_locale_string' functions provide enough abstraction
to make this doable without breaking anything that doesn't use
`scm_take_locale_string' (and even then Guile can detect when the locale
is not UCS-4, revert to `scm_from_locale_string' and `free' the taken
string immediately after conversion). This could be enhanced with
`scm_{to|from}_encoded_string ({char*|SCM} string, enum encoding)'
functions.

> Pragmatically, I see that this can be broken up into three steps.
> (Not for public use.  Just as a programming subtasks.)
>
> 1.  Convert the internal char and string representation to be 
> explicitly ISO 8859-1.  Add the to/from locale conversion functionality
> while still retaining 8-bit strings.  Replace C library funcs with 
> Gnulib string funcs where appropriate.

Initially, I would suggest just using UCS-4 internally and iconv[3] to
handle conversion to/from the locale dependent encodings for
C. Converting to an external encoding within `scm_to_{}_string' has
minimal overhead really--the stringbuf has to be copied anyway (likewise
for `scm_from_{}_string'). If you are writing the externally encoded
string to a stream it is even cheaper--no memory need be allocated
during conversion.

 I think it is acceptable to restrict the encoding of the string passed
`scm_take_string'. If you are constructing strings that Guile can take
possession of you probably have a bit of control over the encoding; if
you don't generating a string and throwing it away more or less
immediately is still pretty cheap if malloc doesn't suck. Adding a
`scm_take_encoded_string' and removing the guarantee from
`scm_take_locale_string' that Guile will not copy the string seems to be
all that is needed to make taking strings work more or less
transparently.

> 2.  Convert the internal representation of chars to 4-byte 
> codepoints, while still retaining 8-bit strings.
>
> 3.  Convert strings to be a union of 1 byte and 4 byte chars.

After getting a basic implementation done using a fixed with internal
encoding rather than doing something like this it seems better to make
the internal encoding flexible.

Basically `make-string' would be extended with an :internal-encoding
argument, or a new `make-string-with-internal-encoding' (with a better
name perhaps) introduced to explicitly specify the internal encoding
the application desires. 

An encoding would be implemented as a protocol of some sort that
implemented a few primitive operations: conversion to UCS-4[^1], length,
substring, concatenate, indexed ref, and indexed set! seem to be the
minimal set for an optimizable implementation. Indices would have an
unspecified type to allow for fancy internal encodings--e.g. a tree of
some sort of UTF-8 codepoints that supported fast substring and
concatenation. Allowing an internal encoding to not implement a
destructive set! opens up some interesting optimizations for purely
functional strings (e.g. for representing things like Emacs buffers
using fancy persistent trees that are efficiently updateable and can
maintain an edit history with nearly nil overhead).

Does this seem reasonable?

[0] http://www.lispworks.com/documentation/HyperSpec/Body/16_a.htm
[1] http://www.lispworks.com/documentation/HyperSpec/Body/21_a.htm
[2] http://www.lispworks.com/documentation/HyperSpec/Body/13_a.htm 
[3] http://www.gnu.org/software/libiconv/
[4] http://www.lispworks.com/documentation/HyperSpec/Body/f_by_by.htm
[5] http://www.lispworks.com/documentation/HyperSpec/Body/f_ldb.htm
[6] http://www.lispworks.com/documentation/HyperSpec/Body/f_dpb.htm#dpb

[^0] `byte'[4] in CL language is some arbitrary width sequence of bits;
     e.g. a /traditional/ byte would be of type `(byte 0 7)' and a
     32-bit machine word `(byte 0 31)'. Unrelatedly, you can do some
     neat things using these arbitrary width bytes with
     `ldb'[5]/`dpb'[6].
[^1] Minimally; ideally an internal encoding would be passed any format
     iconv understands and if possible convert directly to that, but if
     not use UCS-4 and punt to the default conversion function instead.

-- 
emacsen: "Like... windows are portals man...
emacsen: Dude... let's yank this shit out of the kill ring"




reply via email to

[Prev in Thread] Current Thread [Next in Thread]