[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: about strings, symbols and chars.
Re: about strings, symbols and chars.
Thu, 21 Dec 2000 11:40:46 +0100 (MET)
On 19 Dec 2000, Jim Blandy wrote:
> - I don't think something like SCM_CHAR_GET will be fast enough.
> The whole point of mbapi.texi is to give the programmer enough
> information to directly hack on the bytes themselves. People will
> be writing Boyer-Moore searches, regexp engines, conversions to
> other encodings, etc. When you've got loops that touch every
> character you process, you've got to haul the conditionals up out of
> the central loop, so SCM_CHAR_GET isn't good enough.
> Certainly, scm_mb_get will be as slow as, or slower than,
> SCM_CHAR_GET. It's there for the cases where people need simplicity
> over performance.
This sounds as if the proposal offered a faster way to access characters
than scm_mb_get? How can that be? If, in principle, every string may
potentially contain multi-byte characters at any position, you _have_ to
check every character.
You may be able to do some checking in advance, like finding ASCII
substrings in a larger string and then restrict your algorithms to
working on these substrings. But, this means you are actually providing
specialized code for the case of ASCII characters. You could do the same
with the solution I described by checking that you have a 1-byte encoded
The consequences are the same in both situations (for example with respect
to your argument below): People could provide code that was, for example,
only able to deal with single-byte characters. It makes no difference if
they do so by checking whether a string uses a one byte fixed-length
encoding, or whether they call a function that verifies that a string
'coincidentally' only consists of single-byte characters: If they forget
to provide the code that deals with the other cases, they have a
With a variable width encoding I see problems if threads are used: A
thread that does a string-set! can modify the byte positions of a large
set of characters, namely if a character is replaced by one that has a
different width. Assume that thread A steps through the characters of a
string, while thread B modifies single characters of the same
string. Either the code has to be designed such that A fulfills all of
its job before B starts (or the other way around) by using mutexes, or A
has continuously to re-calculate the actual byte starting position of the
character it is working on.
Things are not really different with fixed-width encodings: Doing a
string-set! can require to switch a whole string from a single-byte
representation to a two or four byte representation. But the
recalculation of a character's position is a fast operation.
> - One of the hardest parts about multilingualization is actually
> getting people to do it.
> There is an enormous temptation for C programmers to simply declare,
> "My code only works on 8-bit strings" and get back to work on
> whatever they really wanted to code. All it takes is one module
> along a datapath to be negligent in this regard to make the whole
> system effectively unilingual. Programmers are likely to ignore a
> mixed-representation system. Guile originally had such a string
> type, and, in fact, people ignored it. Guile was full of functions
> that only supported the 8-bit variant. It was useless.
> It is my feeling that a single representation which allows direct
> access to the bytes and can almost always be handled exactly like
> ASCII data is more likely to be an acceptable burden.
I don't see why an API for strings of different fixed-length encoding
should not fulfill the same properties. If you only provide functions and
macros that handle the different encodings, people will also be forced to
code for generic string lengths.