guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: about strings, symbols and chars.


From: Jorgen 'forcer' Schaefer
Subject: Re: about strings, symbols and chars.
Date: 24 Dec 2000 06:50:22 +0100
User-agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7

Dirk Herrmann <address@hidden> writes:

> On 19 Dec 2000, Jim Blandy wrote:
> 
> >   Certainly, scm_mb_get will be as slow as, or slower than,
> >   SCM_CHAR_GET.  It's there for the cases where people need simplicity
> >   over performance.
> 
> This sounds as if the proposal offered a faster way to access characters
> than scm_mb_get?  How can that be?  If, in principle, every string may
> potentially contain multi-byte characters at any position, you _have_ to
> check every character.

If i understood Jim correctly, he thinks scm_mb_next (and it's
cousins like scm_mb_walk) will be faster than either SCM_CHAR_GET
or scm_mb_get, because it operates on the char array directly.
SCM_CHAR_GET has to check the type (and thus the size of the
chars) of the string on each access.  On the other hand,
scm_mb_next has to check how long the next char is on each
access, making it not much faster, if faster at all.

[The following is kinda long, I guess you're aware of all of
this.  I have a short, not very useful conclusion at the end,
though]

I think that the whole problem of multi-byte vs. fixed-byte
encoding is not much of a performance issue.  Fixed-byte strings
are "simpler", and can be accessed randomly without performance
overhead (you could provide a macro which extracts the width of a
given string), but have problems regarding memory usage.  A
single non-latin-1 charakter in a 4k string would make the whole
string take up 8k (8bit to 16bit expansion), while in multi-byte
it requires 4k+1 bytes (long strings are rather uncommon in
usage, though).  Multi-byte strings have problems when it comes
to setting the value of characters -- you might have to copy the
rest of the string if it's size differs from the previous
character size.  Fixed-width strings need only be copied if you
put in a character which needs a "bigger" encoding than you had
available before.

The only real disadvantage of multi-byte strings seems to me that
it's more difficult to set characters at places which had a
different width before.  A more functional approach here would be
benefical.

The disadvantages of fixed-width strings are that they can be
overly space-consuming and require a similar copying as the
multi-byte version, but less often.  Also, they need to
differenciate between different "types" of strings.

> With a variable width encoding I see problems if threads are used:  A
> thread that does a string-set! can modify the byte positions of a large
> set of characters
> [...]
> Things are not really different with fixed-width encodings:  Doing a
> string-set! can require to switch a whole string from a single-byte
> representation to a two or four byte representation.  But the
> recalculation of a character's position is a fast operation.

With multi-byte strings it's "calculate size difference, copy
memory region", which is even more effecient than, say, copying n
1byte locations to n 2byte locations, since the former can be
done wordwise.  But the fixed-width string has to be copied only
once, while the multi-byte string has to be copied many times
over (assuming you're setting a range of chars to a different
size encoding).


Fixed-width strings are faster if setting different-width chars,
which isn't required often and can be avoided.

Fixed-width strings can be easily accessed randomly, though most
of the time, strings are accessed sequencial, which is as fast as
with the multi-byte case.

Fixed-width strings require different types, and switch on it's
type on each access, but multi-byte requires a switch on the
first byte of the next character on each access.

Fixed-width strings consume more memory, but this is not really
relevant since really long strings are rare, and memory isn't.

Concluding, there's not much difference between the two
representations.  I know this is a long mail just to say "hey,
it's not much of a difference", but i guess i had to write it.
Maybe someone can show me where i overlooked something?

Well, just my few cents...
        -- jorgen

-- 
((email . "address@hidden")       (www . "http://forcix.cx/";)
 (irc   . "address@hidden (IRCnet)") (gpg .    "1024D/028AF63C"))



reply via email to

[Prev in Thread] Current Thread [Next in Thread]