[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 and emacs text/string multibyte representation

From: Camm Maguire
Subject: Re: utf8 and emacs text/string multibyte representation
Date: Thu, 30 Oct 2014 10:13:20 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)


Eli Zaretskii <address@hidden> writes:

>> From: Camm Maguire <address@hidden>
>> Cc: address@hidden,  address@hidden
>> Date: Wed, 29 Oct 2014 11:55:13 -0400
>> Does every string access in emacs proceed through the utf8 decoder?
> If you need to look at the character, yes.  E.g., if you need some
> property of the character, you need to index the appropriate table by
> that character's codepoint.  But in most operations that is not
> needed.  You just need to recognize several specific characters, like
> the null character, the slash, etc., most of which are ASCII.

Do you allocate a fresh boxed character on each aref, or output an
integer referring to a fixed ~2^22 sized table?  Do you maintain such a
table in core?

>> >> A cached internal pointer storing the last referenced codepoint
>> >> offset makes access essentially O(1).
>> >
>> > We indeed maintain a cache for byte-to-character and character-to-byte
>> > conversions.
>> How big is this cache?
> Its size is dynamic, and depends on how frequently the conversion is
> needed in places that are far away.  The cache stores byte-to-char
> correspondence in places that are far away, and Emacs uses binary
> search in between them.

How far is 'far away'?

If you had this to do all over again, would you still opt for the

While you have buffers to consider too, which probably relate to
strings, it seems to me that the dominant costs are always memory
allocation/gc related, making the memory footprint important but not at
the expense of allocating characters, and that the most frequent
operations are removals/pattern substitutions, which can proceed
bytewise with the same gc overhead.

GCL also supports regular expressions -- how is this modified for utf-8?

Take care,
Camm Maguire                                        address@hidden
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]