[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: utf8 and emacs text/string multibyte representation
From: |
Eli Zaretskii |
Subject: |
Re: utf8 and emacs text/string multibyte representation |
Date: |
Thu, 30 Oct 2014 18:06:41 +0200 |
> From: Camm Maguire <address@hidden>
> Cc: address@hidden, address@hidden
> Date: Thu, 30 Oct 2014 10:13:20 -0400
>
> >> Does every string access in emacs proceed through the utf8 decoder?
> >
> > If you need to look at the character, yes. E.g., if you need some
> > property of the character, you need to index the appropriate table by
> > that character's codepoint. But in most operations that is not
> > needed. You just need to recognize several specific characters, like
> > the null character, the slash, etc., most of which are ASCII.
> >
>
> Do you allocate a fresh boxed character on each aref, or output an
> integer referring to a fixed ~2^22 sized table?
I'm not sure what you mean by a "boxed character". A character in
Emacs is just an int.
> Do you maintain such a table in core?
We have a lot of tables indexed by characters. Their implementation
is memory efficient: it can store identical values for a range of
characters, and also store the default value with minimal overhead.
> >> > We indeed maintain a cache for byte-to-character and character-to-byte
> >> > conversions.
> >>
> >> How big is this cache?
> >
> > Its size is dynamic, and depends on how frequently the conversion is
> > needed in places that are far away. The cache stores byte-to-char
> > correspondence in places that are far away, and Emacs uses binary
> > search in between them.
> >
>
> How far is 'far away'?
The current heuristic value is 5000 characters.
> If you had this to do all over again, would you still opt for the
> multibyte?
Yes, I think so. I know nobody ever suggested to switch.
> While you have buffers to consider too, which probably relate to
> strings, it seems to me that the dominant costs are always memory
> allocation/gc related, making the memory footprint important but not at
> the expense of allocating characters, and that the most frequent
> operations are removals/pattern substitutions, which can proceed
> bytewise with the same gc overhead.
We don't allocate characters, they are just integers.
As for strings, Emacs allocates small strings specially, to minimize
overhead. And of course, there's GC that takes care of freeing
memory.
> GCL also supports regular expressions -- how is this modified for utf-8?
We use GNU regexp, slightly modified for Emacs. I suggest to take a
look at the source.
- Re: Referring to revisions in the git future., (continued)
- Re: Referring to revisions in the git future., Stefan Monnier, 2014/10/29
- utf8 and emacs text/string multibyte representation, Camm Maguire, 2014/10/29
- Re: utf8 and emacs text/string multibyte representation, Eli Zaretskii, 2014/10/29
- Re: utf8 and emacs text/string multibyte representation, Camm Maguire, 2014/10/29
- Re: utf8 and emacs text/string multibyte representation, Eli Zaretskii, 2014/10/29
- Re: utf8 and emacs text/string multibyte representation, Camm Maguire, 2014/10/31
- Re: utf8 and emacs text/string multibyte representation,
Eli Zaretskii <=
- Re: utf8 and emacs text/string multibyte representation, Camm Maguire, 2014/10/31
- Re: utf8 and emacs text/string multibyte representation, Eli Zaretskii, 2014/10/31
- Re: utf8 and emacs text/string multibyte representation, Camm Maguire, 2014/10/31
- Re: utf8 and emacs text/string multibyte representation, Stephen J. Turnbull, 2014/10/31
- Re: utf8 and emacs text/string multibyte representation, Stefan Monnier, 2014/10/29
- Re: utf8 and emacs text/string multibyte representation, Raymond Toy, 2014/10/29
- Re: [Gcl-devel] utf8 and emacs text/string multibyte representation, Camm Maguire, 2014/10/31
- Re: [Gcl-devel] utf8 and emacs text/string multibyte representation, Stefan Monnier, 2014/10/31
- Message not available
- Re: utf8 and emacs text/string multibyte representation, Andreas Schwab, 2014/10/31
- utf8 and emacs text/string multibyte representation, Stephen J. Turnbull, 2014/10/29