[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

EROS/Coyotos address spaces

From: Jonathan S. Shapiro
Subject: EROS/Coyotos address spaces
Date: Wed, 19 Oct 2005 15:42:12 -0400

Espen asked me to explain how the wrappers work in EROS address spaces.
To do this I need to explain how address spaces work.


Remember that EROS is transparently persistent. The paging mechanism is
independent of the fault handling logic, and is completely invisible to
most applications (some real-time apps need to know).


EROS does not have MAP/UNMAP, and it does not have anything that can be
directly compared to a mapping database. In EROS, an address space is
described by a tree of Nodes. In it's simplest form, this tree looks
very similar to a conventional hierarchical page table. Ignoring a long
list of optimizations and details that are not very relevant for now,
there are two important differences:

  1. A node holds capabilities. A page table holds page table entries.
  2. A node is 32 slots. Page tables are usually wider.

The "leaf" objects in both trees are pages.

Every EROS process structure contains a capability called it's "address
space capability". This names the top node of the process address space.
When a page fault occurs, the kernel walks the node tree, computes the
permissions and validity of the path, inserts the corresponding page
table entries, and resumes the process.

When a valid page table entry is constructed, a "dependency entry" is
recorded between the page table entry and *every node slot that was
traversed to build it*. If a node slot is altered (or the node is
revoked), the associated page table entries are invalidated. EROS is
always free to throw away page tables (and does, in practice) -- the
"official" state of an address space lives in the nodes. The page tables
and the dependency entries together implement an MMU-friendly cache of
the node paths. The traversal is performed in such a way that hardware
page tables are shared in every situation where this can be done

If no valid path through the node tree exists, the "nearest enclosing
keeper" is invoked. A "keeper" is EROS-speak for a fault handler. In
EROS, fault handlers are separate processes. It is possible to have a
fault handler that shares an address space with the subject process, but
we have never done it. Fault handlers are not privileged in any way.
They do not participate in anything comparable to the L4 pager

At any point in the node tree, one may insert a wrapper node "in front"
of an existing subtree. This wrapper node may specify a start (entry)
capability in the CF slot, and set a control bit. The control bit
indicates that a "keeper" is defined defined by this wrapper.

   [If you need a reminder about wrappers:
    http://lists.gnu.org/archive/html/l4-hurd/2005-10/msg00340.html ]

The "nearest enclosing keeper" is, by definition, the keeper specified
by the wrapper that (a) appears on the translation path to the faulting
page, and (b) appears in the *closest* wrapper to the page. If no such
keeper exists in the address space tree, the fault is reported to the
per-process keeper. If no process keeper capability is present, the
faulting process is marked with a fault code and ceases to initiate

  [ EROS distinguishes memory faults from all other faults.
    Memory faults are reported according to the rules above.
    All other faults are reported directly to the process 
    keeper. ]

If the nearest enclosing keeper is unable or unwilling to handle the
fault, it can propagate the fault to the process keeper. We have
considered designs in which the entire hierarchy of keepers could be
visited, but we have never been able to demonstrate any practical value
for this.

** Some Other Details **

1. An EROS node capability encodes (node id, permissions, height). A
node subtree can legally be either "too big" or "too small" and there is
a sensible interpretation in each case. This allows us to remove
unnecessary nodes from the address space trees.

2. If you map a node subtree from an untrusted provided into your space,
they could include a hostile keeper. If you fault inside this space,
that keeper can steal your thread of control. There is a permission bit
NC ("no call") that prevents this. The precise specification is that the
fault is delivered to the nearest enclosing *callable* keeper.

** Deficiencies **

The big deficiency in this design is that the node structures define a
tree of fixed fanout (arity). In consequence, the only natural sharing
points occur at multiples of 32^k pages for k>=0.  A less important
deficiency is that the encoding density is less than we would like.


In Coyotos we are not going to use Nodes for the memory map, and we do
not use wrappers to specify keepers. We will instead use PATTs. A PATT
is a data structure of the form

   h = height of this tree (address length in bits)
   r = height of every immediate child tree
   ( pattern, cap )[16] -- capabilities to children

As with nodes, the address space is a tree of PATTs with pages at the
leaves. However, the PATT design is closer to Liedtke's GPT design. The
interesting part is the vector of child capabilities, which is **fully
associative**. The rules for patterns are:

  A pattern is matchable if 2^r <= pattern <= 2^h
  All patterns in a PATT must be unique

To traverse a PATT, we have an address A that is being translated.

   A >= 2^h  ==> invalid address
   (A & ~(2^r-1)) matches some pattern then
        if access rights sufficient ==> traverse through that slot\
        else ==> permissions violation
   else invalid address

The unmatchable pattern "1" is used to indicate a keeper capability. The
unmatchable pattern "-1" is used to indicate an unused slot.

The vector of (pattern, cap) pairs is sorted by unsigned comparison. A
variation of the dependency structure for Nodes works for PATTs as well.

Eric Northup is writing up a thorough performance analysis for PATTs
that will be available by the end of October, but the short version is
that they are only a few cycles slower than nodes and they are MUCH more

Here is the part that is INCREDIBLY COOL (I am not very enthusiastic
about this :-)): You can "split" a PATT into parent and children at any
bit position. In principle, you can split all the way down to a binary
tree, and (ignoring the vector length limit) all the way up to a flat
map. We have formal descriptions of these split and join operators, and
we can show formally how to do correctness preserving and maximally
sharing preserving transformations from a PATT tree to any known
hardware mapping structure.

But here is something that is EVEN MORE COOL! Because we can split
PATTs, it is possible to take an existing PATT tree and perform a single
split operation that creates a capability that spans an arbitrary
multiple of 2^k pages (k>=0). Sounds similar to a FlexPage? Yes! And
better still, we can define transmit and receive operations for
subspaces that perform a FLEXTREE CAPABILITY COPY in a single operation,
resulting in a shared subspace.

But here is the VERY BEST PART OF ALL (okay, okay, I'm removing my
CapsLock key now): if you have a really small server (small number of
total pages) it can be described by a single PATT structure, which makes
mapping traversals very fast for our performance-critical servers. More
generally, real address trees are dense at the leaves and sparse at the
top, and the PATT structure provides a pretty good representation for
this. Better, certainly, than hash structures and fairly easy to scale
to 64 bit spaces.

Okay. I'm going to go take something that will CALM ME DOWN. :-)

Espen: does this give you enough to get what you wanted from it?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]