[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode and Guile

From: Tom Lord
Subject: Re: Unicode and Guile
Date: Mon, 3 Nov 2003 12:31:50 -0800 (PST)

    > From: Andy Wingo <address@hidden>

    > On Sat, 25 Oct 2003, Tom Lord wrote:

    > > What do the index arguments to STRING-REF and STRING-SET refer to?
    > > Byte positions or character positions?

    > >From (r5rs)Strings,

    >   The _length_ of a string is the number of characters that it contains.
    >   This number is an exact, non-negative integer that is fixed when the
    >   string is created.  The "valid indexes" of a string are the exact
    >   non-negative integers less than the length of the string.  The first
    >   character of a string has index 0, the second has index 1, and so on.

    > Clearly, the intention is not to specify the underlying representation.
    > It would not be Correct to allow string-ref to "leak out" details about
    > the underlying representation, by referencing partial characters for
    > instance.

Part of the problem is that Unicode specifications are very careful to
_not_ define "character" (except ambiguously).

In different contexts related to my question, it might mean a unicode
code point, a code value, or something more complicated such as a
grapheme (which may be represented as a string of unicode code

It's a nasty problem to try to unify unicode types with scheme types.


* CHAR? is a code value in some encoding (say, UTF-8 or UTF-16)

  In other words, CHAR? is an 8 or 16 bit integer that happens to 
  coincide with ASCII values in some ways.

  A string is then a homogenous array of such values -- and that's
  simple enough.

  But now CHAR? can't represent all unicode code points.

  A variation on this says that CHAR? is a (subset of?) 21 bit values
  and strings (semantically) a homogenous array of those but now
  either STRING-REF and friends change in their "expected" complexity 
  or the string representation has to become quite complex.

* CHAR? is a unicode code point -- a 21 bit value.

  This approach has the same problems with string efficiency or 
  complexity -- but it has the advantage that algorithms defined 
  in terms of unicode code points (e.g. collation) translate very
  directly into Guile Scheme.

* CHAR? is a "grapheme" -- the user's idea of a character.

  Ray Dillenger is currently exploring this (see recent c.l.s.)

  It too requires a very complicated STRING? representation and,
  worse, an infinitely large set of characters.   On the other hand,
  of the three possibilities, it goes farthest in hiding the details
  of representation from users.

    >> There's a need for a new type, `text', which acts like the text
    >> contents of an emacs buffer and has (yes I agree) pretty much the
    >> Emacs interface. It should all be designed so that, internally, people
    >> can write new ways to represent text objects and multiple text object
    >> representations can coexist in the same application (just like emacs).
    >> There's no good reason not to throw in attributes, overlays, and
    >> markers for text objects too (just like emacs).

    > Maybe. This issue is, in my opinion, orthogonal to simple strings.

But perhaps its worth mentioning in this context because it suggests a
very straightforward approach for Guile:

CHAR? is 8 bits.  STRING? is a sequence of 8-bit chars.  And
everything unicode is orthogonal to that.   While there may be support
for manipulating unicode strings represented as STRING? and unicode
characters represented as CHAR?, fundamentally, CHAR? and STRING? are
kept butt-simple and the unicode support is something new.

A nice side effect of that simple-minded approach is that it works
well with foreign functions written to handle UTF-8.

    > Users of guile-gtk (the -gobject 2.0 branch) will just use
    > GtkTextBuffer (and its associated view, GtkTextView). Those that
    > pine after emacs won't be satisfied until you can read mail in
    > your text buffer ;)

There's a point of view that says "look, traditional strings have a
very simple and clear operational model that is fundamentally
different from what a `unicode string' is.   It would be a shame to
take away support for that simple, traditional string type as a
precondition for making unicode text processing simpler."


reply via email to

[Prev in Thread] Current Thread [Next in Thread]