[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode and Guile

From: Tom Lord
Subject: Re: Unicode and Guile
Date: Tue, 11 Nov 2003 17:27:41 -0800 (PST)

    > From: Marius Vollmer <address@hidden>

    > - But is there a fixed-width Unicode representation?  I.e., is UTF-32
    >   just like ASCII only with more bits or is there more to it?  Are
    >   there combining characters in UTF-32?  If there are, then there is
    >   no reason to go looking for a fixed-width, old-style text
    >   representation.

No offense, but you need to do more homework.  It's not easy homework,
either -- hence the "no offense".

Yes, there are combining characters in UTF-32.   UTF-32 is something
called a "character encoding form" and combining characters are
completely orthogonal to encoding forms.

If you can afford it, grab a copy of the Unicode standard.   Or check
on -- for all I know it's freely available these days.

Please read my proposal on this list and c.l.s.  The standard Scheme
character and string types are not sanely unicode-friendly unless
interpreted as rather low-level operations.  It makes more sense to
say that CHAR? values are octets than to say they are "unicode

    > - If we go with a variable width encoding, we can just as well use
    >   UTF-8 and replace strings/chars with something new, like Tom's
    >   texts/graphemes.

It's not quite "replace" but yeah -- where traditionally you'd teach a
newbie to use characters and strings, teach them instead to use the
(subtlely different) graphemes and texts.

    > - What kind of data type are strings anyway?  Vectors or lists?
    >   Traditionally, they have been mutable vectors, but variable-width
    >   encoding of 'characters' might force us to rethink this, in general.
    >   People expect constant time accesses for vector-like things, but we
    >   will probably not want to guarantee them for a variable-width
    >   encoding (with integers as indices).

A "vector of octets" is so remarkably useful that Scheme should not
fail to provide it.  CHAR? and STRING? types are compatible with
"octet and vector of octets" but not with Unicode.  So: add TEXT? and
GRAPHEME? for "string processing" and let CHAR? and STRING? be
octet-based types.

I'm as surprised as you are.  I've spent many months assuming that I
wanted CHAR? to be able to hold an arbitrary unicode code point -- but
it simply does not work out.

    > - So the text/grapheme API should maybe be more abstract, and not be
    >   using integers to refer to graphemes contained in texts but some
    >   opaque 'iterator', 'subtext' or 'grapheme range' thing.

It can use integers just fine except that, in the face of mutations to
a "text", integer indexes don't behave well.  So, yeah, there's a need
for "markers" which are an example of "cursors".  I think it's ok,
though, to expose the integers that underly markers, though.  They
behave comparably to (what you expect of) the integers in most cases.

    > - Shared subtexts or grapheme ranges are easy to do for read-only
    >   texts, but harder for mutable text.  So texts should maybe be
    >   unmutable by default.  Mutable texts and pointers into it might use
    >   a more expensive data structure, like a gap buffer.

I think that a tree structure is better than a gap buffer as the
default implementation option.  Shared subtexts should use markers to
represent their extents.

    > - For Guile specifically, the problematic thing is the C API.  Right
    >   now, strings are pretty much fixed to be vectors of unsigned bytes.
    >   We can't do much about this without breaking code.  So from that
    >   point of view, a new API for Unicode stuff looks like a good thing
    >   as well, when we can convince ourselves that people are willing to
    >   move over to that new API.

The proposal preserves the view that a string is a vector of unsigned
bytes.   It also adds a higher level view.

    > - The representation of texts would be determined by what is most
    >   natural for existing C code.  I.e., I think that Gtk+ uses UTF-8 and
    >   when we find that most libraries that we want to access from Guile
    >   use UTF-8 as well, we should make our text representation UTF-8.

That's an internal implementation detail, not a detail that should be
reflected in interface specifications.

It'd be just fine if Guile initially provided a C API that only
well-supported UTF-8, but that shouldn't be imposed on the
Scheme-level interfaces.

    > - Old code can be supported by allowing string-*, char-*, etc. to work
    >   on UTF-8 encoded texts that uses only ASCII code points.  That will
    >   causes problems to the 8-bit users (like latin-1, etc.), tho.  C
    >   code must avoid storing non-ASCII characters into such strings, and
    >   I'm not sure right now whether we can keep it from doing that in a
    >   compatible way.

No, I think you're basically screwed in that area.   Sorry.

And that leaves you with a choice between forking Guile from R5RS or
breaking upwards compatability to usages from C.   I suspect that the
damage to "usages from C" can be minimized to such a degree that
that's the way to go.

In this area, by the way, I'd suggest an encoding type which is not
UTF-8 but which is ISO-* for users of 8-bit sets.   If someone pokes
an unrepresentable character into an ISO-* string, either signal an
error or mutate the string's encoding --- either way will save current
C usages.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]