[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

O(1) accessors for UTF-8 backed strings

From: Mark H Weaver
Subject: O(1) accessors for UTF-8 backed strings
Date: Sat, 12 Mar 2011 23:05:27 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)

address@hidden (Ludovic Courtès) writes:
> I also think strings should remain what they currently are, with O(1)
> random access.

I just realized that it is possible to implement O(1) accessors for
UTF-8 backed strings.

The simplest solution is to break the stringbuf into chunks, where each
chunk holds exactly CHUNK_SIZE characters, except for the last chunk
which might have fewer.  To find a character at index i, we scan forward
(i % CHUNK_SIZE) characters from the beginning of chunk[i / CHUNK_SIZE].
Ideally, CHUNK_SIZE is a power of 2, which allows us to use bitwise
operations instead of division.  The number of characters to scan is
bounded by CHUNK_SIZE, and therefore takes O(1) time.

String-set! in general requires us to resize a chunk.  Although chunks
contain a constant number of characters, that of course maps to a
variable number of bytes.  There are various tricks we can do to reduce
the number of reallocations done, but even in the worse case, the time
spent is bounded by CHUNK_SIZE, and thus O(1).

Converting a string to UTF-8 now consists of concatenating its chunks.
For short strings (<= CHUNK_SIZE) we can simply return the address of
the first chunk.  In the general case, we'll have to recombine the
chunks into a new block, which of course takes O(n) time.

With a little added complexity, we can reduce this cost in common cases,
and even eliminate it completely when string-set! is not used.  The idea
is to initially store a stringbuf in a single piece.  Only after the
first string-set! would the stringbuf be broken up into chunks.

A flag in the stringbuf would indicate whether or not it is chunked.  In
a chunked string, each chunk[i] points to its own memory block.  In an
unchunked string, the entire string is in one block of UTF-8, pointed to
by chunk[0].  chunk[i] for i>0 points into the middle of the block, at
an offset of i*CHUNK_SIZE characters.  This allows string-ref to treat
chunked and unchunked stringbufs equivalently.

It is probably worthwhile to actually _convert_ a stringbuf back into
unchunked form whenever we convert a string to utf8.  That way, in the
common case where string-set! is only used to initialize the string,
later conversions to utf8 can be done in constant time (except the first
one, which can be charged to the initialization routine).

Now, it is my hope that string-set! will be seldom used, and that we can
encourage people to instead use string ports and/or string-unfold and
string-unfold-right.  These other functions can be implemented without
string-set!, and will not require the use of the chunked representation.

Therefore, it is my hope that the chunked representation can be avoided
altogether in most all cases, which enables fast constant-time
conversions to UTF-8, and thus the efficient use of libunistring and any
other library that understands UTF-8.

What do you think?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]