[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: about strings, symbols and chars.

From: Jim Blandy
Subject: Re: about strings, symbols and chars.
Date: 21 Dec 2000 11:29:58 -0500

Dirk Herrmann <address@hidden> writes:

> On 19 Dec 2000, Jim Blandy wrote:
> > - I don't think something like SCM_CHAR_GET will be fast enough.
> > 
> >   The whole point of mbapi.texi is to give the programmer enough
> >   information to directly hack on the bytes themselves.  People will
> >   be writing Boyer-Moore searches, regexp engines, conversions to
> >   other encodings, etc.  When you've got loops that touch every
> >   character you process, you've got to haul the conditionals up out of
> >   the central loop, so SCM_CHAR_GET isn't good enough.
> > 
> >   Certainly, scm_mb_get will be as slow as, or slower than,
> >   SCM_CHAR_GET.  It's there for the cases where people need simplicity
> >   over performance.
> This sounds as if the proposal offered a faster way to access characters
> than scm_mb_get?  How can that be?  If, in principle, every string may
> potentially contain multi-byte characters at any position, you _have_ to
> check every character.

The proposal is to allow client code direct access to the byte array,
and to guarantee enough properties of the encoding to make that
feasible.  This is even faster than your macro, because you don't end
up with conditionals in your inner loops, and it gives you license to
use heavily tweaked library routines like memchr and memcpy.

I think the source of the disconnect is this: you're assuming that the
fundamental operation on strings is indexing by character number, but
it's not.  Strings are overwhelmingly more often scanned sequentially.
For that kind of access, variable-width representations are not a
serious problem.  However, in order to take advantage of the access
pattern, you have to move some of the logic out of the accessor and
into the surrounding loop.  My argument is that, given the other
primitives described in mbapi.texi, this is little work, and is

A fixed-width representation makes random access efficient, but random
access is rare, and fixed-width representations:
- waste space and increase the memory bandwidth used, and
- require loops to be written out several times for efficiency, to
  hoist conditionals based on the character size outside the loop

In cases where random access is necessary, mbapi.texi provides
primitives (like the cached indexing functions) that should do pretty

reply via email to

[Prev in Thread] Current Thread [Next in Thread]