[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 and emacs text/string multibyte representation

From: Raymond Toy
Subject: Re: utf8 and emacs text/string multibyte representation
Date: Wed, 29 Oct 2014 08:56:55 -0700
User-agent: Gnus/5.101 (Gnus v5.10.10) XEmacs/21.5-b34 (darwin)

>>>>> "Camm" == Camm Maguire <address@hidden> writes:

    Camm> Greetings!  I've recently been considering supporting unicode in gcl 
    Camm> representing strings internally in utf8.  It appears that emacs does 
    Camm> same or similar.  Apart from the obvious memory footprint benefits, 
    Camm> like to ask what other advantages/disadvantages have been discovered.
    Camm> Much of the utf8 literature emphasizes that most algorithms can 
    Camm> conventionally in byte-wise fashion, including lexicographical 
    Camm> comparisons, given that almost all jobs are sequential, at least
    Camm> initially.  A cached internal pointer storing the last referenced
    Camm> codepoint offset makes access essentially O(1).  Yet setting string
    Camm> elements can trigger reallocations/memmove operations.  While these 
    Camm> be aggregated over the setting of multiple elements, operations like
    Camm> nreverse look ridiculous if left in terms of calls to aref and aset.

    Camm> Thoughts, advice and experiences most appreciated.

Have you looked at what other Lisp implementations do? AFAIK, none use
utf-8. CCL and clisp use utf-32, cmucl and allegro use utf-16, sbcl
and ecl(?) have two string types: 8-bit base-string and 32-bit

As a one-man operation (unfortunately), I'd go with the easiest one to
get right and follow either ccl or cmucl.  The rest of the support for
unicode can be added with libraries like cl-unicode and/or babel, if
need be.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]