[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: reentrant port table stuff
Re: reentrant port table stuff
27 Aug 2001 00:16:41 +0200
Gnus/5.09 (Gnus v5.9.0) Emacs/21.0.102
[Finally picking up an olde thread]
Chris Cramer <address@hidden> writes:
> On Sat, Jul 28, 2001 at 02:43:15PM +0200, Marius Vollmer wrote:
> > Hmm, I don't think this is completely right. I think you can't
> > perform `extensive' operations like scm_must_malloc when interrupts
> > are disabled. Could you try to reformulate your patch so that only
> > `simple' operations are performed while interrupts are disabled?
> > Maybe mallocs.h can help to protect the entry struct until it is
> > safely stuffed into the table.
> Are there rules written down somewhere for when you should/shouldn't
> disable interrupts, and what you should/shouldn't do when they're
I don't think so, but we should definitely do so. Interrupts and
asyncs etc are mysterious to me as well, and I think some fundamental
things about them have changed so that rethinking the whole issue and
documenting it is in order.
The first major thing on the agenda is (in my view): do we want to
care about full POSIX thread support, or are we satisfied with mere
>From an idealistic standpoint, I would say: within one address space,
cooperative threading is best since it leads to a vastly simpler
(low-level) programming model. The concept of disallowing context
switches is meaningful and a very cheap method to implement critical
sections. For us, this would mean that C code would not be
interrupted without it taking actions to allow it (i.e. invoking
SCM_TICK), and Scheme code could easily and cheaply prevent
interruptions by setting some global flag.
[ There is the problem of non-cooperating threads. I.e., some code
could be written in a way to never give the CPU to the next thread.
That is a bug, since code that runs in one address space is expected
to be cooperating. We could have a way for killing off such threads
via a special low-level signal handler. For example, when SIGUSR1
(or some other configurable signal) is received, the process issues
a message and starts counting context switches. When the next
SIGUSR1 is received, and there haven't been any such context
switches, it offers to kill the current thread.
If true multiprocessing is desired (parallel execution on multiple
CPUs), it should take place in separate address spaces, and
communication should take place via explicit mechanisms. This can be
message passing, even involving shared memory to save on copying, but
the two processes should not assume that they can villy nilly access
each others data structures.
Having true multiprocessing in a shared address space seems overkill
to me. The issues go from sprinkling the code with mutexes down to
bloated cache coherency protocols. I wouldn't be surprised if
not-entirely-correct critical sections would hunt us like overflowing
sprintfs. I see true multiprocessing mainly for number crunching, not
as a tool for conveniently expressing concurrent control logic. In
serious number crunching, I think communication between tasks needs to
tackled very explicitely anyway when carving out the parallelism from
your algorithms so making it explicit in the code will actually only
help. (In fact, one could say that the communication pattern
_defines_ the parallelism.) There needs to be ways to achieve a rich
communication, of course, but I think the Lisps are well equipped for
Ok, this is my current, general position towards multi-threading. I
guess I would be happy if the whole computing community would take
this stance and we wouldn't have to deal with POSIX threads. But they
don't and being a library, we have to deal with that. Doing it right
means that every cons is a critical section, for example, which could
likely triple or quadruple its instruction count. But this is hardly
inhibiting. We could install two versions of libguile, one that is
pthread safe, and one that is merely coop-thread safe. But can we
identify all critical sections?
So what are the interrupts that we are disallowing? I think it used
to be that a Unix signal (like SIGINT) could interrupt the current
function at any time and cause the flow of control to do arbitrary
things (like invoking continuations, or capturing the current
continuation and return some time later, even multiple times). You
either had to write code that could deal with this, or use
SCM_DEFER_INTS / SCM_ALLOW_INTS.
This is no longer true, since signals do not immediately change the
flow of control. Instead they `mark' an `async' which gets `run' at
the next SCM_TICK (even if ints are disabled???). SCM_TICK is
currently called in the evaluator and in `equal?', and would probably
be called at certain places in compiled code.
With pthreads, there is no such concept of deferring interrupts to
implement a critical section, since pthreads allow true concurrency.
In addition to setting scm_ints_disabled, SCM_DEFER_INTS and
SCM_ALLOW_INTS also enter/leave a critical section, which is
presumably implemented as a mutex or semaphore for pthreads.
Ok, what do we make from this? First I think we should remove old
cruft, like the code covered by GUILE_OLD_ASYNC_CLICK, and streamline
the async implementation. Then, SCM_ALLOW_INTS and SCM_DEFER_INTS are
no longer appropriately named. They mark critical sections, and
disabling interrupts does not save you from parallel access to global
data structures. We should rename them to something else or start
using SCM_CRITICAL_SECTION_START / SCM_CRITICAL_SECTION_END instead.
Then, the thing you wanted to know all along: what is one allowed to
do inside such a critical section? I'd say, almost nothing. No call
to any scm_ function except those that are specifically marked to be
OK. Maybe we can divide the scm_ functions and macros into two
classes (supported by a naming convention): one class can only be used
outside of critical sections, the other can _only_ be used inside
SCM_DEFER_INTS / SCM_ALLOW_INTS (or what becomes of them).
SCM_NEWCELL would be a thing that is only valid within a critical
section, while scm_must_malloc likely can only be used outside.
Or rather, no errors can be thrown while in a critical section, so
SCM_NEWCELL would be out, as well. Or, SCM_NEWCELL could start a
critical section of its own, so that it either signals an out of
memory error, or enters a critical section. It is probably best to
remove SCM_NEWCELL from the public API completely, only presenting
scm_cons, scm_make_smob, etc.
I think that we need to address memory allocation in general. Code
char *mem1 = scm_must_malloc (100, "mem1");
char *mem2 = scm_must_malloc (100, "mem1");
SCM_RETURN_NEWSMOB2 (..., mem1, mem2);
is wrong since a failure to allocate mem2 would leak mem1. One should
SCM mem1 = scm_malloc_obj (100);
SCM mem2 = scm_malloc_obj (100);
init (SCM_MALLOCDATA (mem1));
init (SCM_MALLOCDATA (mem2));
SCM smob = SCM_NEWSMOB2 (...,
SCM_MALLOCDATA (mem1), SCM_MALLOCDATA (mem2));
SCM_SETMALLOCDATA (mem1, NULL);
SCM_SETMALLOCDATA (mem2, NULL);
so that the memory is always known to the GC and can be freed when the
above code sequence is aborted due to non-local exit.
As to the port table race condition, I frankly don't know why Chris'
fix is effective. SCM_DEFER_INTS should not in fact be able to
inhibit thread switches since scm_ints_disabled is not checked
anywhere (except in take_signal, where it is irrelevant). I'll have a
Summary: we need to clean up the async implementation and rethink the
critical section issue. I have put that on my little to-do list, but
I want to deal with GH first.
Not a really helpful answer, I'm afraid...