[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Approaches to storage allocation

From: Jonathan S. Shapiro
Subject: Approaches to storage allocation
Date: Sun, 09 Oct 2005 16:37:59 -0400

Okay. Let's deal with reference counting. I have the impression that
Hurd inherited the use of reference counting from UNIX without a great
deal of consideration. We do not (and cannot) do this in Coyotos because
of resource denial, and we do not need to.

In my experience there are only two reasons to use reference counting:

  1. Storage reclamation. I want to reclaim storage when the last client
     is done. I would like to do this implicitly.

  2. Resource cleanup. I want to perform some action when an object

In Coyotos, we view both of these as a mistake. The first really *is* a
mistake. The second is something we occasionally want, but it violates
encapsulation in a way that we have not yet been willing to accept.

Let me talk first about storage reclamation:

The usual case where I want implicit, reference counted storage
reclamation goes like this:

  I have a server. It allocates objects. It uses its own storage to do
  so, so it is very important to free that storage when the object
  becomes unreferenced. These designs fall into two categories:

    1. Designs that assume the storage available to the server is
       unbounded. These designs are simply incorrect.
    2. Designs that assume that any allocating operation is permitted
       to fail.

The problem with this design is that not all clients are equal. Consider
a file server. We have certain critical subsystems that MUST NOT fail.
To ensure that these services are robust, we design them in such a way
that we can fully state their resource requirement. For example: they
write log messages to circular logs of fixed size. [The UNIX approach is
simply wrong: it is based on a race between the log cleaner and the
logging application. The problem is that you can lose this race, and
then a critical application fails.]

Now that we have a known resource requirement, we quickly move to
implementing quotas in the file server so that allocations by our
critical applications will be protected from interference. The next move
is usually to introduce some form of principal ID or authority token so
that we can keep track of which application is making which call. We
implement the quota with some form of counter in every server.

In some servers, of course, we implement it incorrectly. Because the
implementation is decentralized, we usually fail to consider that these
servers are *in turn* consumers of resource, and any resource guarantee
that they make to a client is contingent on resource guarantees made to
the server itself. The design I have outlined introduces a hierarchy of
such guarantees, and there is ample evidence that such guarantees are
*never* administered correctly in real systems.

Conceptually, the solution that we use in Coyotos is simple: we place
the burden of supplying storage on the client rather than on the server.
In any operation that allocates storage, the client supplies a "Space
Bank" object from which the server allocates the storage on behalf of
this client.

To simplify the protocols, it is normal for the client to specify a
space bank at object creation time. The server associates this bank with
the object for later use when that object is extended, grown, or shrunk.

When storage is to be reclaimed, the holder of the space bank simply
destroys the bank. Poof. No more object.

There are good points and bad points to the Coyotos approach:


- There is never any complex negotiation about storage. Either you
  have it or you do not. If you do not, you screwed up and your
  operation fails. Nobody else's operation fails because you screwed up.

- The party paying for the storage can always deallocate by destroying
  the bank. There is no way to hold storage hostage.

- A shared server (one that serves multiple clients acting from
  separate storage quotas) never uses its own storage to allocate

- Storage allocation does not become interdependent: release of your
  object does not (and must not) depend on release of my object.

- There is no cross-talk between clients of an object unless the object
  itself is designed to support that cross talk. In particular, there
  is no notification when the last client disappears (which is a covert
  channel, but one that can be easily justified).

- Applications using space banks seem to form a natural hierarchy of
  storage allocators. In practice, this makes storage reclaim pretty
  natural. The tricky cases are things where an object must survive
  the program that allocates it. I'll explain those in a minute.

- Deallocation is prompt. In high-security implementations of UNIX,
  the RM operation is required to operate immediately, which violates
  the reference count model. If we are going to violate it anyway,
  why have it?

- Memory can leak, but it is explicitly reclaimable.


- When a server allocates storage this way, it must be prepared for
  the storage to spontaneously disappear. In some cases, this can
  restrict the choice of algorithms and data structures (e.g. linked

- Unless the clients of an object agree on some protocol -- perhaps
  one that is implemented by the object itself -- there is no way to
  know when the storage can be reclaimed. In our view, any voluntary
  reference count system should be implemented by the object itself.

- When you have a server that serves multiple clients, it must pay
  attention to which source of storage it is using.

- When objects are used by multiple parties in a shared way, this
  requires planning.

In practice, here is what we observe in actual usage:


The overwhelming majority of objects are private. In fact, the
overwhelming majority of object *servers* are private. It is very common
for an EROS application to instantiate a new file system that it used
for its temporary files. This file system is not shared with anyone, and
it usually runs directly from the same space bank as the client -- they
are a single unit of failure (this is common, but not required). We
describe this sort of arrangement using the term "exclusively held":
this file system is exclusively held by its client. In fact, the space
bank used for allocation is provided at file system creation time. There
is a per-file sub-bank, but this has nothing to do with quotas.

We observe that in exclusively held relationships, the lifetime of the
service and/or the objects *never* exceeds the lifetime of the client
(in practice).


If we look at objects that survive longer than the program that creates
them, the majority of these objects are private to a single user (or a
single user environment). In this case the space bank still works: the
user's "home directory" is actually a file server that is private to
that user and runs from the user's space bank. When it is intended that
the object survive its creating program, it is simply allocated from a
surviving server.


The tricky part of explicit storage management is sharing. If you and I
share a workspace (e.g. a CVS repository), who pays for the storage?

In practice, the answer almost always seems to be: "We create another
account for this, and all of the participants allocate storage from this
account. We accept the risk of misbehavior, and design ways to clean

The same can be done in EROS: a system administrator can allocate a
space bank that is *not* associated with any particular user, use this
to allocate a file system, and make that file system acceptable to all
of the participating users.

Better still: they can make it accessible to the users *only* when using
particular applications. For example, it can be arranged that only the
CVS tools can write into this space. This does not prevent users from
checking stupid things into CVS, but it *does* prevent other
applications from scribbling in this space (because we do not give those
applications access to CVS).

A user can allocate this storage too. It is not that we are relying on
the system administrator to do this. What we need is someone who (a)
*has* enough storage, and (b) we can trust them not to revoke it.


This is the case where I already have an existing private file. You and
I agree that it needs to have shared storage, so I want to switch the
ownership of its storage. This is why we create a sub-bank per file. It
allows us to execute the "storage exchange" protocol that I hinted at
earlier, moving the ownership of the object storage from one domain to

In our experience, the big complication with managed storage is those
exceptionally rare cases where (a) a server really needs to manage
multiple storage domains, and (b) the storage requires some sort of
indexing structure over its objects. The problem is precisely the linked
list problem that Matthieu has previously identified. We do not have a
good solution to this, and we do not believe that a good solution exists
in principle if denial of resource is a real concern.

Because shared stuff tends to operate out of a "commons" space bank, we
have not yet seen this to be a serious problem. There are a *very* few
cases where we deal with it -- notably the ethernet driver and the
window manager.

All of the examples we have seen of servers of this kind appear to be
fundamentally part of the systemwide TCB, and clients inherently trust
them whether they use them or not.

So far, we have dealt with this issue by saying that we will tolerate
"free riders" for the indexing structures. Our view is that the indexing
structures are a small fraction of the total system storage, and for the
very small number of servers that need to do this we are prepared to
accept a global tax on storage.

However, we don't believe that this solution scales. A better solution
(which we have not yet implemented, but we could easily do) is a space
bank that cannot be deallocated without server consent. Yes, this
permits the server to hold the client hostage, but if the server is part
of the systemwide TCB, the client is *intrinsically* hostage to the good
behavior of that server.

Note, however, that *none* of this requires reference counts anywhere.
We simply exploit the pre-existing hierarchy of relationships between
subsystems to structure our storage relationships.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]