[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

futex for Mach?

From: Roland McGrath
Subject: futex for Mach?
Date: Mon, 17 Feb 2003 00:50:41 -0800

When the new threads library that became NPTL began to materialize, I had
been envisioning adapting it with a genericized port using user-space
queues and IPC for wakeups.  That is how cthreads works, how Neal's
libpthread works, and how LinuxThreads works.  The NPTL library is quite
different, and relies entirely on Linux's new "futex" system call.

Lately I have been considering the contrary path of implementing the futex
calls for Mach.  There are two kinds of reasons to do this.

The obvious set of reasons boils down to "because Linux does".  That is, we
can port NPTL without a great deal of change and even use the optimized
assembly versions of the synchronization primitives with only slight change.

But the other reasons are that it seems like a useful facility.  The
clincher for me is that it makes process-shared pthread_mutex's in shared
mmap regions Just Work with no extra thought.  It's sufficiently clumsy in
Mach to be well nigh infeasible to do an inter-process mutex implementation
with real blocking (i.e. not just spin locks), since it has to be by IPC
and it require big hair to have a global identifier in the shared data
structure.  There are other cases of blocking, like sigsuspend/sigwait,
where I think that futex can be a more natural and convenient (and probably
efficient) way to implement it than using IPC as it has to now.

The futex inteface is a simple and light-weight kernel-provided blocking
and wakeup facility that associates a wait queue with any word in memory
without prior setup.  Any two threads sharing any one page can synchronize
with each other just by referring to an address within that page in futex
calls.  (Hence NPTL's mutex and cond implementions done purely with futex
can be shared with mmap and just work.)

The use of virtual memory as the global identifier for kernel objects is
similar to Fluke kernel mutexes.  However, futex requires no special setup
of the memory.  The kernel context for the synchronization object exists
only as long as some thread is blocked on a futex call using a particular

There are two operations: a "compare and block" and a wakeup.  The wakeup
call just gives an address and a minimum number of threads to wake up; it
wakes threads blocked in association with that same word of memory.  The
blocking call gives an address, a value for the word there, and optionally
a timeout.  If the word at the address matches the argument value, it
blocks until another thread does the wakeup call, it times out, gets a
signal, or a spurious wakeup is also permissible.  The comparison against
the memory word's value and blocking is atomic with respect to any wakeup;
i.e. changing the word and then calling wakeup will have no bad races with
a thread that checks the word and then does the compare and block futex call.

The main path of the implementation is easy.  You just follow the page
tables and use the physical address as the synchronization identifier.  The
tricky part is when a page gets paged out while threads are blocked using
the old physical page as an identifier.  When someone later pokes the page
back in it will probably be a different physical page and it needs to know
to wake up the threads that blocked using the old page as identifier.

The Linux implementation uses a system-wide, small hash table indexed by
task + virtual-address recording the interested threads.  The vm system
makes callbacks when a page table change affects an element in that table.
Threads blocking on futexes get this callback and rehash their waits using
the new physical page address.  (I'm not entirely convinced that the
current Linux code actually works right in all cases of memory sharing.
But that's neither here nor there.)

Mach's vm system is rather different.  But the Linux implementation I think
shows that a small global hash table is sufficient as far as scaling.  It
was a little counterintuitive to me, but really it only needs to scale with
the number of different locations on which threads are blocked.  I don't
know the vm system well and am just reading code to speculate on how to
implement this so far.  I am considering a hash table indexed by vm_page_t
and offset within the page.  Some place like vm_page_activate would check
the hash table and do the callbacks to wake up blocked threads.  This
approach doesn't handle mapping changes, but I am happy with that being
undefined behavior (changing the mapping out from under threads blocked in
a futex call).  This needs more thought.

What do people think?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]