[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: string-map arg order

From: Dirk Herrmann
Subject: Re: string-map arg order
Date: Tue, 4 Sep 2001 23:26:11 +0200 (MEST)

On 3 Sep 2001, Alex Shinn wrote:

> >>>>> "Dirk" == Dirk Herrmann <address@hidden> writes:
> Ah, I was asking if I should keep hacking or if we are back to
> discussion... looks like the debate is revived :)
>     Dirk> The proposal is nice as long as you only have one thread.
>     Dirk> In such a context, Jim's idea about the scm_mb_cache type
>     Dirk> are nice.  However, when it comes to multithreading, you
>     Dirk> cannot guarantee that your cache is valid, except you use a
>     Dirk> mutex for the whole lifetime of a scm_mb_cache.
> The same applies to the idea of multiple fixed-width string
> types... you may have to upgrade a 1-byte wide string to a 2 or 4 byte
> wide string.  The only way you can avoid this is by making everything
> utf-32, but then all strings and symbols would use 4 times as much
> space.  Another huge disadvantage of utf-32 is that almost nothing
> else uses it, so you'd have to continuously convert strings passed to
> or returned from external library functions.  All C wrappers would
> have to change their current interface to use special accessors to
> Guile strings, which would be terribly complicated if you want to
> write a mutator.  utf-8 does not avoid all these problems (some
> current applications may be using latin-1 chars), and as you point out
> may have performance concerns, but since one of the main goals of
> Guile is strong interoperability with C, utf-8 seems like the only
> viable solution.

It is true, that changing a fixed width string can involve changing it
from 1-byte width to 2 or 4 byte width.  And when accessing characters in
such a system with multiple threads running, you would also have to be
aware that such a change could happen at any moment.  Thus, again you
would need to recompute a character's position with every
access.  Therefore, both with fixed-width string types and variable-width
string types the recomputation is necessary with _every_ character
access.  The difference is, that the recomputation is cheap for fixed
width types, and expensive for variable-width string types.

Further, you would not start by making everything utf-32.  Rather, you
would start with a 1-byte width and only increase width as necessary,
which is at most 2 times:  1->2->4.  With a variable width encoding, you
may have to increase the size (n * (m-1)) times, n being the string
length, m being the maximum character width.  Further, with a variable
width encoding, the function make-string _requires_ that you initialize
the allocated string with characters, because otherwise there may be an
inconsistency between the string length and the number of characters in
the string.  With a fixed width representation, make-string does not have
to do any initialisation.

>     Dirk> One of the arguments for a variable width encoding has been,
>     Dirk> that mutations are rare and that these copy operations
>     Dirk> (which are necessary if you replace a character with one
>     Dirk> that has a longer encoding) would therefore be seldom.
>     Dirk> However, please note that still for accessor functions like
>     Dirk> string-ref and substring you are working with integer
>     Dirk> indices.  In a multi thread environment, the mapping between
>     Dirk> character index and memory address of that character may
>     Dirk> change at any moment.  Even if such mutations are rare, you
>     Dirk> don't know _when_ they actually happen in a parallel thread.
>     Dirk> That is, _every_ call to string-ref has to recompute the
>     Dirk> actual character position.  The same with the character
>     Dirk> indices given to substring.
> There's still some room for optimizations.  If you're not running
> multi-threaded, or the string is not shared, then there's no problem.
> Even if you are, you can set a dirty bit when a string is mutated.
> Alternately, you make shared memory strings copy-on-write into
> non-shared memory (unless they've been set up with explicit mutex).
> Or make string mutexes implicit.

If the string is only accessed from a single thread, there's no problem,
true.  But:  given a string object you can't tell which threads have
access to it.  The implementation of string-ref therefore has to assume
that there _might_ be accesses from other threads.  Certainly, you could
provide a function 'string-ref-single-thread' which does no such checks,
but I doubt that this is the way to go.

And, please note:  My argument is _completely_ independent of the question
whether there are different strings sharing the same memory.  The problem
already exists if there is a single string that is accessed from two
different threads:  Thread A accesses the string, and thread B mutates

I don't quite understand your point about a 'dirty bit':  This does not
work without using mutecis anyway, so what is the difference?  (Note that
all my arguments against the variable width encoding are focused at the
multi-threaded environment.  In a single-threaded environment, things are
much simpler and in such a case a variable width encoding is quite nice.  
We are aiming, however, at providing support for multiple threads, even
preemptable threads like posix threads.  And, IMO it makes sense to take
this into consideration when a new string interface is designed.  
Otherwise, we may have to completely redesign it again later.

> IMO, we should not be optimizing for string mutations, and should in
> fact try to offload as much of the work involved in our solution into
> the mutators.  One of the observations in the previous discussion was
> that the current string primitives encourage treating strings as
> random access, and we would be better off with higher-level primitives
> that treated strings in a more opaque, functional manner.  Since then
> we've gotten srfi-13 with procedures such as string-map and
> string-fold which do just that, and could be modified to handle
> caching and threading issues implicitly.

Well, I don't claim we should optimize for string mutations.  But,
again:  Since string mutations can happen _anytime_ (in a parallel
thread) this influences the performance of all _accessors_.

>     Dirk> IMO, given a variable-width encoding, effective string
>     Dirk> handling in a multi-threaded environment is impossible on
>     Dirk> the scheme level.  A simple loop like
>     Dirk> (do (i 0 (+ i 1)))
>     Dirk>     ((= i (length s)))
>     Dirk>   (if (eqv? (string-ref s i) #\a)
>     Dirk>       (display "foo\n")))
>     Dirk> would work in O(n*n) time complexity where n is the length
>     Dirk> of s.
> True.  And since implicit caching at the string level is probably
> impractical, this would be expected to be slow regardless of
> threading.  But in a variable-width encoding scheme, it would simply
> be a given that random access for strings is slow, and people should
> avoid string-ref.  It would have been much cleaner to have written
> (string-for-each (lambda (c) (if (eqv? c #\a) (display "foo\n"))) s)
> to begin with.

I admit that the example is simplistic.  But, with multiple threads you
don't gain anything by using string-for-each either, except you were
inhibiting mutations to that string throughout the execution of
string-for-each by using some mutex.  This, however, does not work since
it would inhibit nesting these constructs.  The same argument holds for
the idea of using one mutex per string and having it locked during
string-map, string-fold or string-for-each, since you could not iterate
over the same string in a nested manner.

Friendly regards
Dirk Herrmann

reply via email to

[Prev in Thread] Current Thread [Next in Thread]