[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: about strings, symbols and chars.

From: Jim Blandy
Subject: Re: about strings, symbols and chars.
Date: 13 Jan 2001 12:08:10 -0500

> If i understood Jim correctly, he thinks scm_mb_next (and it's
> cousins like scm_mb_walk) will be faster than either SCM_CHAR_GET
> or scm_mb_get, because it operates on the char array directly.

No --- I must be doing a terrible job of explaining this.

Under any representation, it is not possible to conceal the
fundamentals of the string representation from C code, without a
substantial loss in efficiency:

- fixed-width representations must either use 32 bits for all
  characters, wasting memory and (perhaps more importantly, nowadays)
  memory bandwidth, or support several alternate encodings.  To
  conceal this would require the use of conditionals at the heart of
  every character-processing loop.

- variable-width representations are very difficult to treat as
  arrays; hiding these at the C level is simply not feasible.

For these, and other reasons, I believe we should expose our string
representation to C code.  It should be part of the documented
interface.  This will allow users to recognize when they don't need
the fully general primitives, and write the best code for their
particular situation.

Now, in that context, we can ask whether we should use a fixed-width
or a variable-width representation.

If you look at actual, real-world string processing code, it turns out
that people scan more often than they index randomly.  In scanning
applications, you can handle variable-width strings simply by using
byte indices instead of character indices, or byte pointers.  It's not
a problem.

Furthermore, it turns out that UTF-8 and the Emacs MULE encoding
facilitate all kinds of clever tricks such that it's really easy to
write code which operates on them.  (Dirk's example of case conversion
happens *not* to be such an example.)

And finally, Guile used to use exactly the representation that Dirk
proposed --- an indicator in the header of the width, and then an
array of 8-, 16-, or 32-bit characters as appropriate.  We've tried
it.  It was a total pain in the neck.  Nobody wanted to deal with it,
and as a result, there were many places in Guile which simply rejected
anything but 8-bit strings.

In contrast, writing code that operates on UTF-8 strings is usually
straightforward.  You're still dealing with char *'s --- one type,
instead of three.  Concatenation is unchanged.  Scanning for ASCII
characters is unchanged.  Comparison for equality is unchanged.
(Lexicographic comparison is complicated by Unicode hair, which
affects either representation equally.)  In short, UTF-8 is more

And a non-analytical argument: note that Perl and Tcl use UTF-8.  It
hasn't killed them.

(In general, the advantages of UTF-8 also accrue to Emacs MULE --- you
just need different tables.)

The most serious disadvantage is that the Scheme primitives still
promote string indexing as the primary way to get at strings.  I think
the best solution to this has several parts:

- Provide better primitives.  This is what Perl and Tcl do, and I
  think it's requisite.  The primitives can be smart about the
  encoding.  Cheap substrings, for one.

- Make strings include index caches.  With the cache, scanning
  algorithms will remain linear-time.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]