[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Thread model (was: Ext2 superblock fault)

From: olafBuddenhagen
Subject: Thread model (was: Ext2 superblock fault)
Date: Tue, 11 Mar 2008 04:53:45 +0100
User-agent: Mutt/1.5.17+20080114 (2008-01-14)


On Sun, Mar 09, 2008 at 10:17:05PM -0400, Thomas Bushnell BSG wrote:
> On Mon, 2008-03-10 at 01:19 +0000, Samuel Thibault wrote:

> > This thread is syncing everything, i.e. asking a lot of writes,
> > which triggers the creation of a lot of threads.  Unfortunately the
> > superblock was paged out, so they all block on reading it.
> > Unfortunately, since in Debian there is a patch which limits the
> > number of created threads, the read of the superblock doesn't
> > actually create a new thread, that is delayed.  But since none of
> > the existing threads can progress (since they are all waiting for
> > the super block), things are just dead locked...
> As a general rule, the Hurd always assumes that an RPC can be handled;
> this is quite emebedded in the way diskfs works.
> A patch which limits the number of threads is inherently buggy in the
> Hurd, and that patch MUST be disabled for anything to work properly.

I'm glad this discussion came up at last: This is a very serious issue,
and your input is necessary.

The real problem here is that the current thread model of the Hurd
servers is fundamentally broken. Creating a new kernel thread for each
incoming RPC just doesn't work. The amount of incoming RPCs can be
pretty much unlimited, but kernel threads are a limited resource -- a
quite expensive resource, in fact.

And this is not a theoretical problem, but a very real one: On heavy
disk load, the filesystem server easily created hundreds, even thousands
of threads. During large compile jobs for example the Hurd was crashing
regularily with zalloc panics. And the problem became more and more
pressing with machines getting faster and disk space usage larger.

Another easy way to reproduce the problem is to create a process with
many children, each opening /dev/null many times (the number of open
files per process is limited, thus we need many children), and then
killing all the children quickly (e.g. by terminating the parent). With
some 30000 or so total open ports, this is guaranteed to result in a
zalloc panic -- the null server is hammered with dead name
notifications, resulting again in thread explosion.

When these issues became known, Sergio Lopez implemented a hack that
simply limits the number of thread created by each single server to a
fixed amount. When more RPCs come in, they won't be handled until some
of the threads becomes free.

While this patch did wonders for stability under load, and no problems
showed at first, it immediately struck me as having potential for
causing deadlocks. When I pointed that out, Sergio made a slight
modification: Rather than completely disabling creation of new threads
once the limit is reached, each further thread becomes active only after
idling for two seconds -- so additional threads get created very slowly.

I never was happy with that solution(?), and suggested a more adaptive
approach: Keep track of the existing threads, and if none of them makes
progress in a certain amount of time (say 100 ms), allow creating some
more threads. But that was never implemented. Also, it still might cause
considerable delays in some situations; and I'm not even sure it would
fix all problems. (I didn't fully understand the problem discussed in
this thread, so I don't know whether it would be fixed by that?)

And anyways, it still would be just an ugly workaround. The real
solution here of course is to fix the thread model -- using some kind of
continuation mechanism: Have a limited number of threads (ideally one
per CPU) handle incoming requests. Whenever some operation would require
blocking for some event (in the case of diskfs, waiting for the
underlying store to finish reading/writing), the state is instead saved
to some list of outstanding operations, and the thread goes on handling
other requests. Only when the event completes, we read the state back
and continue handling the original request.

Of course, that will be a major change; it requires modification of
considerable parts of the Hurd servers. But it seems the only way to
handle this properly. What do you think?

I wonder whether I should add this to the list of project ideas for


reply via email to

[Prev in Thread] Current Thread [Next in Thread]