copy on write strings - some intermediate thoughts

guile-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
copy on write strings - some intermediate thoughts

From:	Dirk Herrmann
Subject:	copy on write strings - some intermediate thoughts
Date:	Thu, 16 Nov 2000 11:59:36 +0100 (MET)
Hello everybody.

I'd like to inform you about some new aspects with regards to the
implementation of copy-on-write strings for guile.  In advance I can tell
you, that adding such a beast to guile would require to change a lot of
code and also to change the string API.

OK, now to some details:

* Sharing parts of strings and symbols requires to split up the current
  string and symbol objects into two parts:  One part that holds the
  characters (let's call this the memory-region), and one part that
  determines which characters in the memory region form a string or
  symbol.  Several strings and symbols can share parts of the memory
  region.  The memory-region does not only represent the field of
  characters, but - in order to be able to implement an efficient
  copy-on-write strategy - also stores additional information like how
  many client objects refer to the memory region etc.  Further, the
  memory-region can hold flags like whether the field of characters is
  read only or such.

This means, introducing implicitly shared strings will add at least one
level of indirection:  string-object --> memory-region.  But, it may be
beneficial to add another level of indirection in the context of
threading, as I will try to explain:

* A string object has to store the following attributes:  length,
  the memory-region object and the char* that points to the first
  character of the string within the memory region.  It is necessary to
  store both the reference to the memory region as well as to the
  character within the memory region:  To modify the characters, it
  has first to be checked that the characters are not shared between
  different strings or symbols.  (Otherwise a copy has to be created
  first, which then is modified - this is the idea of copy-on-write.)
  This information, however, is stored in the memory-region.

The length of a string is never modified.  But:  The memory-region as well
as the char* will be modified if a copy-on-write is actually performed.
Moreover, the memory-region and the char* for a pair of values which must
always match:  The char* does always point into the field of characters of
the memory-region.  Thus, reading these two values as well as modifying
them has to be done as an atomic operation.  It would be fatal if a thread
would for example first read the memory-region reference, then a second
thread would preempt that thread, modify the string object causing a
copy-on-write, and after that the first thread would read the char*, which
in this case already pointed into the _new_ memory-region.

One solution to this problem is to use a mutex when reading the value pair
memory-region/char*.  However, this would have to be done for both read
and write accesses which means a performance penalty for almost every
string operation:  One function call to obtain a mutex, another one to
release it.

A different solution is to use another level of indirection by introducing
what I will call the 'string-content' object:  The string-content object
holds the memory-region/char* pair, and the string-object itself then only
stores the string length and the reference to the string-content object:
string-object --> string-content-object --> memory-region

With this strategy, accessing the memory-region/char* pair of a string is
done by first accessing the string-content-object.  The
string-content-object always holds a matching memory-region/char* pair and
is never changed after its creation.  Thus, parallel threads can never
corrupt the string-content-object.  Sequential accesses to memory-region
and char* will always match when it is assured that they use the same
string-content object.

Further, for a copy-on-write a new memory-region as well as a new
string-content object would be created.  After that has been done,
changing the string object to point to the freshly created string-content
object therefore is an atomic update of the memory-region/char* pair.

The advantage of this additional indirection is, that for read accesses no
mutex has to be obtained.  The disadvantage is, that a single string
object will then need three different objects for its representation:  
string-object, string-content-object and memory-region.

However, if there will only be a single thread, the use of an intermediate
string-content object is unnecessary.  Thus, if guile is configured
without threading, this indirection should be avoided.  The API, however,
should allow to implement string handling code in a way that unifies both
approaches.  An example is given below:

SCM
scm_opendir (SCM dir)
{
  DIR *ds;
  SCM content;

  SCM_VALIDATE_STRING (dir);
  scm_string_coerce_0termination_x (dir);

  content = SCM_STRING_CONTENT (dir);
  chars = SCM_STRING_CONTENT_CHARS (content);
  SCM_SYSCALL (ds = opendir (chars));
  scm_remember_object (content);  /* protect content till here */

  ... /* create directory-port from ds. */
}

The macros SCM_STRING_CONTENT and SCM_STRING_CONTENT_CHARS can be defined
differently for threading / non-threading:

Threading:
#define SCM_STRING_CONTENT(x) SCM_CELL_OBJECT_1 (x)
#define SCM_STRING_CONTENT_CHARS(x) SCM_CELL_WORD_1 (x)

No threading:
#define SCM_STRING_CONTENT(x) (x)
#define SCM_STRING_CONTENT_CHARS(x) SCM_CELL_WORD_1 (x)

Thus, in the non-threaded case there is actually no string-content
object.  The string-object itself holds all the attributes.  The macro
SCM_STRING_CONTENT just returns the string-object itself.  In contrast, in
the threaded case the macro SCM_STRING_CONTENT actually returns a
string-content object.


It should have become obvious that with the above suggestions string
handling code becomes more complicated.  Further, memory requirements
increase from currently one heap cell per string to two or even three heap
cells per string.  I can't tell whether the performance gain will be large
enough to justify the increased memory requirements and code complexity,
but this is something we can easily figure out.  We might even make the
whole copy-on-write thing configurable such that only one cell per string 
is used (as it is currently done), but this does not safe us from the
need to change the string handling code to the most general form.

Best regards
Dirk
[Prev in Thread]
Current Thread
[Next in Thread]
copy on write strings - some intermediate thoughts, Dirk Herrmann <=
Prev by Date: [patch] libguile/eval.c
Next by Date: Re: [patch] libguile/eval.c
Previous by thread: [patch] libguile/eval.c
Next by thread: [patch] guile-readline/configure.in
Index(es):
- Date
- Thread