[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

utf8 and emacs text/string multibyte representation

From: Stephen J. Turnbull
Subject: utf8 and emacs text/string multibyte representation
Date: Thu, 30 Oct 2014 12:08:33 +0900

Camm Maguire writes:

 > Greetings!  I've recently been considering supporting unicode in gcl by
 > representing strings internally in utf8.  It appears that emacs does the
 > same or similar.  Apart from the obvious memory footprint benefits,

If you need to *edit* large strings at arbitrary positions with high
performance, the memory footprint benefits are reduced by the need to
cache char position vs. memory position.  If you're on a 64-bit
architecture, those cache entries chew up memory 16 bytes at a time.

I think Emacs does a much better job of handling the position cache
than XEmacs does, so you're asking in the right place.  Just be aware
that it's possible to do it poorly. :-)

 > Yet setting string elements can trigger reallocations/memmove
 > operations.  While these can be aggregated over the setting of
 > multiple elements, operations like nreverse look ridiculous if left
 > in terms of calls to aref and aset.

How many of those operations are there, though?  At worst, nreverse
requires a few bytes of temporary storage to be implemented
efficiently.  If there are only a few of them, just implement them as

Note that Python has chosen to use a "just big enough for the data"
fixed-width representation, and AFAIK the Python-licensed code is
This strategy has the advantage that manipulating strings internally
is always an array operation, so Python code can be efficient
(enough); you don't need to reimplement such operations as primitives,
and there are no gotchas for user code where the user code looks like
it's operating on an array (efficient) but is actually moving large
chunks of memory around all the time.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]