[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: What shall the filter do to bottommost translators

From: olafBuddenhagen
Subject: Re: What shall the filter do to bottommost translators
Date: Fri, 9 Jan 2009 08:01:23 +0100
User-agent: Mutt/1.5.18 (2008-05-17)


On Wed, Dec 31, 2008 at 02:42:21PM +0200, Sergiu Ivanov wrote:
> On Mon, Dec 29, 2008 at 8:25 AM, <olafBuddenhagen@gmx.net> wrote:
> > On Mon, Dec 22, 2008 at 07:19:50PM +0200, Sergiu Ivanov wrote:

> > The most radical approach would be to actually start a new nsmux
> > instance for each filesystem in the mirrored tree. This might in
> > fact be easiest to implement, though I'm not sure about other
> > consequences... What do you think? Do you see how it could work? Do
> > you consider it a good idea?
> >
> I'm not really sure about the details of such implementation, but when
> I consider the recursive magic stuff, I'm rather inclined to come to
> the conclusion that this will be way too much...

Too much what?...

> Probably, this variant could work, but I'll tell you frankly: I cannot
> imagine how to do that, although I feel that it's possible.

Well, it's really almost trivial...

Let's take a simple example. Say the directory tree contains a file
"/a/b/c". Say nodes "a" and "b" are served by the root filesystem, but
"b" is translated, so "b'" (the effective, translated version of "b")
and "c" are served by something else.

Let's assume we have an nsmux set up at "n", mirroring the root
filesystem. Now a client looks up "/n/a/b/c". (Perhaps indirectly by
looking up "/a/b/c" in a session chroot-ed to "/n".) This results in a
lookup for "a/b/c" on the proxy filesystem provided by nsmux.

nsmux forwards the lookup to the real directory tree it mirrors, which
is the root filesystem in our case. The root filesystem will look up
"a/b", and see that "b" is translated. It will obtain the root of the
translator, yielding the translated node "b'", and return that as the
retry port, along with "c" as the retry name. (And RETRY_REAUTH I

Now in the traditional "monolithic" implementation, nsmux will create a
proxy node for "a/b'", and pass on a port for this proxy node to the
client as retry port, along with the other retry parameters. (Which are
passed on unchanged.)

The client will then do the retry, finishing the lookup.

Now what about the control ports? The client could do a lookup for
"/n/a/b" with O_NOTRANS for example, and then invoke
file_get_translator_cntl() to get the control port of the translator
sitting on "b". nsmux in this case forwards the request to the original
"/a/b" node, but doesn't pass the result to the client directly.
Instead, it creates an new port: a proxy control port to be precise. The
real control port is stored in the port structure, and the proxy port is
passed to the client. Any invocations the client does on this proxy
control port are forwarded to the real one as appropriate.

The same happens when the client looks up "/n/a/b/c", and then invokes
file_getcontrol() on the result: nsmux forwards the request to the real
"/a/b/c" node, and proxies the result.

With the "distributed" nsmux, things would work a bit differently. Again
there is a lookup for "/n/a/b/c", i.e. "a/b/c" on the proxy filesystem
provided by nsmux; again it is forwarded to the root filesystem, and
results in "a/b'" being returned along with a retry notification. The
distributed nsmux now creates a proxy node of "a/b" (note: the
untranslated b). It starts another nsmux instance, mirroring "/a/b'",
and attaches this to the "a/b" proxy node.

Again, the client will finish the lookup. (By doing the retry on the new
nsmux instance.)

When the client invokes file_get_translator_cntl() on "/n/a/b", the main
nsmux sees that there is a translator attached to this proxy node
(namely the second nsmux), and returns its control port to the client --
just like any normal filesystem would do. When the client does
file_getcontrol() on "/n/a/b/c", the second nsmux will simply return its
own control port, again like any normal filesystem.

Now unfortunately I realized, while thinking about this explanation,
that returning the real control port of nsmux, while beautiful in its
simplicity, isn't really useful... If someone invoked fsys_goaway() on
this port for example, it would make the nsmux instance go away, instead
of the actual translator on the mirrored tree.

And we can't simply forward requests on the nsmux control port to the
mirrored filesystem: The main nsmux instance must handle both requests
on it's real control port itself (e.g. when someone does "settrans -ga
/n"), and forward requests from clients that did fsys_getcontrol on one
of the proxied nodes to the mirrored filesystem. So we can't do without
passing some kind of proxy control port to the clients, rather than the
nsmux control port.

Considering the we need the proxy control ports anyways, the whole idea
of the distributed nsmux seems rather questionable now... Sorry for the
noise :-)

> > But let's assume for now we stick with one nsmux instance for the
> > whole tree. With trivfs, it's possible for one translator to serve
> > multiple filesystems -- I guess netfs can do that too...
> Could you please explain what do you mean by saying this in a more
> detailed way?

Well, normally each translator is attached to one underlying node. This
association is created by fsys_startup(), which is usually invoked
through trivfs_startup() for filesystems using libtrivfs.

However, a translator can also create another control port with
trivfs_create_control(), and attach it to some filesystem location
manually with file_set_translator(). (This is not very common, but the
term server for example does it, so the same translator can be attached
both to the pty master and corresponding pty slave node.) We get a
translator serving multiple filesystems.

I assume that this is possible with libnetfs as well...

Anyways, this variant would probably behave exactly the same as the
"distributed" one, only that a single process would serve all nsmux
instances, instead of a new process for each instance. It would have the
same problems... So we probably don't need to consider it further.

> > The alternative would be to override the default implementations of
> > some of the fsys and other RPCs dealing with control ports, so we
> > would only serve one filesystem from the library's point of view,
> > but still be able to return different control ports.
> > 
> > As we override the standard implementations, it would be up to us
> > how we handle things in this case. Easiest probably would be to
> > store a control port to the respective real filesystem in the port
> > structure of every proxy control port we return to clients.
> This is variant I was thinking about: custom implementations of some
> RPCs is the fastest way. At least I can imagine quite well what is
> required to do and I can tell that this variant will be, probably, the
> least resource consuming of all.

Well, if this is the variant you feel most comfortable with, it's
probably best to implement this one :-)

We can still change it if we come up with something better later on...

> > > As for the ``dotdot'' node, nsmux usually knows who is the parent
> > > of the current node; if we are talking about a client using nsmux,
> > > it is their responsibility to know who is the parent of the
> > > current node.
> > >
> > > OTOH, I am not sure at all about the meaning of this argument,
> > > especially since it is normally provided in an unauthenticated
> > > version.
> >
> > AIUI it is returned when doing a lookup for ".." on the node
> > returned by fsys_getroot(). In other words, it normally should be
> > the directory in which the translated node resides.
> Yep, this is my understanding, too. I guess I have to take a glimpse
> into the source to figure out *why* this argument is required...

Well, there must be a way for the translator to serve lookups for the
".." node... So we need to tell it what the ".." node is at some point.

One might think that a possible alternative approach would be to provide
it once at translator startup, instead of individually on each root
lookup. The behaviour would be different however: Imaginge a translator
sitting on a node that is hardlinked (or firmlinked) from several
directories. You can look it up from different directories, and the ".."
node should be different for each -- passing it once on startup wouldn't
work here.

> > As the authentication is always specific to the client, there is no
> > point in the translator holding anything but an unauthenticated port
> > for "..".
> Sorry for the offtopic, but could be please explain what do you mean
> by authentication here? (I would just like to clear out some issues in
> my understanding of Hurd concepts)

I wish someone would clear up *my* understanding of authentication a
bit... ;-)

Anyways, let's try. File permissions are generally checked by the
filesystem servers. The permission bits and user/group fields in the
inode determine which user gets what kind of access permissions on the
file. To enforce this, the filesystem server must be able to associate
the client's UID and GID with the UIDs and GIDs stored in the inode.

So, how does the filesystem server know that a certain UID capability
presented by the user corresponds to say UID 1003?

This is done through the authentication mechanism. I'm not sure about
the details. AIUI, on login, the password server asks the auth server to
create an authentication token with all UIDs and GIDs the user posesses.
(Well, one UID actually :-) ) This authentication token is (normally)
inherited by all processes the user starts.

Now one of the user processes contacts the filesystem server, and wants
to access a file. The filesystem server must be told by the auth server
what UIDs/GIDs this process has. This is what happens during the
reauthentication: A bit simplyfied, the client process presents its
authentication token to the auth server, and tells it to inform the
filesystem server which UIDs/GIDs it conveys. From now on, the
filesystem server knows which UIDs/GIDs correspond to port (protid) the
client holds.

(The actual reauthentication process is a bit tricky, because it is a
three-way handshake...)

This process needs to be done for each filesystem server the client
contacts. This is why a reauthentication needs to be done after crossing
a translator boundary -- or at least that is my understanding of it. The
retry port returned to the client when a translator is encountered
during lookup, is obtained by the server containing the node on which
the translator sits, and obviously can't have the client's
authentication; the client has to authenticate to the new server itself.

> > > Well, setting translators in a simple loop is a bit faster, since,
> > > for example, you don't have to consider the possibility of an
> > > escaped ``,,'' every time
> >
> > Totally neglectible...
> >
> Of course :-) I've got a strange habit of trying to reduce the number
> of string operations...

I also suffer from this unhealthy tendency to think too much about
pointless micro-optimizations... The remedy is to put things in
perspective: In an operation that involves various RPCs, a process
startup etc., some string operations taking some dozens or hundreds of
clock cycles won't even show up in the profile...

Code size optimisations are a different thing of course: Not only are
ten bytes saved usually much more relevant than ten clock cycles saved
(except in inner loops of course); but also it tends to improve code
quality -- if you manage to write something with less overall
operations, less special cases, less redundancy etc., it becomes much
more readable, considerably less error-prone, much easier to modify, and
alltogether more elegant...

> The main reason that makes me feel uneasy is the fact that retries
> actually involve lookups. I cannot really figure out for now what
> should be looked up if I want to add a new translator in the dynamic
> translator stack...

The not yet processed rest of the file name -- just like on any other

These can be additional suffixes, or also additional file name
components if dealing with directories. When looking up
"foo,,x,,y/bar,,z" for example, the first lookup will process "foo,,x"
and return ",,y/bar,,z" as the retry name; the second will process ",,y"
and return "bar,,z"; the third will process "bar,,z" and return ""; and
the last one will finish by looking up "" (i.e. only get a new port to
the same node, after reauthenticating).

I'm not entirely sure, but I think the retry is actually unavoidable for
correct operation, as reauthentication should be done with each new

BTW, I just realized there is one area we haven't considered at all so
far: Who should be able to start dynamic translators, and as which user
should they run?...

> > That's the definition of dynamic translators: They are visible only
> > for their own clients, but not from the underlying node. (No matter
> > whether this underlying node is served by another dynamic
> > translator.)
> That seems clear. What makes we wonder, however, is how a filter will
> traverse a dynamic translator stack, if it will not be able to move
> from a dynamic translator to the one which is next in the dynamic
> translator stack?

Well, note that the filter is started on top of the stack. It is a
client of the shadow note created when starting it, which is a client of
the topmost dynamic translator in the stack, which is a client of its
shadow node, which is a client of the second-to-top dynamic
translator... So fundamentally, there is no reason why it wouldn't be
able to traverse the stack -- being an (indirect) client of all the
shadow nodes in the stack, it can get access to all the necessary

The question is how it obtains that information... And you are right of
course: I said in the past that the translator sitting on a shadow node
doesn't see itself, because it has to see the other translators instead
(so the filter can work). This is obviously a contradiction.

This is a bit tricky, and I had to think about it for a while. I think
the answer is that a dynamic translator should see both the underlying
translator stack, *and* itself on top of it. For this, the shadow node
needs to proxy all translator stack traversal requests (forwarding them
to the underlying stack), until a request arrives for the node which the
shadow node mirrors, in which case it returns the dynamic translator
sitting on the shadow node.

Let's look at an example. Say we have a node "file" with translators
"a", "b" and "c" stacked on it. The effective resulting node translated
trough all three is "file'''". When we access it through nsmux, we get a
proxy of this node.

Now we use a filter to skip "c", so we get a proxy node of "file''" --
the node translated through "a" and "b" only. Having that, we set a
dynamic translators "x" on top of it. nsmux creates a shadow node
mirroring "file''", and sets the "x" translator on this shadow node.
Finally, it obtains the root node of "x" -- this is "file'',,x" -- and
returns an (ordinary non-shadow) proxy node of that to the client.

(To be clear: the "'" are not actually part of any file name; I'm just
using them to distinguish the node translated by static translators from
the untranslated node.)

And now, we use another filter on the result. This is the interesting

First another shadow node is set up, mirroring "file'',,x"; and the
filter attached to it. (So temporarily we get something like
"file'',,x,,f".) Now the filter begins its work: It starts by getting
the untranslated version of the underlying node. The request doing this
(fys_startup() IIRC?) is sent to the underlying node, i.e. to the second
shadow node, mirroring (the proxy of) "file'',,x". This "x" shadow node
forwards it to the node it mirrors: which is the aforementioned
(non-shadow) proxy node of "file'',,x".

nsmux may be able to derive the completely untranslated "file" node from
that, but this would be rushing it: It would put the first, "file''"
shadow node out of the loop. (Remember that conceptually, we treat the
shadow nodes as if they were provided by distinct translators...) So
what it does instead is obtaining the first shadow node (mirroring the
proxy of "file''"): This one is the underlying node of the "x"
translator, and is considered the untranslated version of the node
provided by "x". It asks this shadow node for the untranslated version
of the node it shadows...

So the "file''" shadow node in turn forwards the request to the proxy of
"file''". This node is handled by nsmux directly, without any further
shadow nodes; so nsmux can directly get to the untranslated "file" node,
and return a proxy of that.

This new proxy node is returned to the first shadow node. The
(conceptual) shadow translator now creates another shadow node: this one
shadowing (the proxy of) the untranslated "file". (Oh no, even more of
these damn shadow nodes!...) A port for this new shadow node is then
returned to the requestor.

The requestor in this case was the "file'',,x" proxy node, which also
sets up a new proxy node, and passes the result (the proxy of the shadow
of the proxy of untranslated "file"...) on to the second (conceptual)
shadow translator, mirroring "file'',,x". This one also creates a new
shadow node, so we get a shadow of a proxy of a shadow of the "file"
proxy node... And this is what the filter sees.

In the next step, the filter invokes file_get_translator_cntl() on this
shadow-proxy-shadow-proxy-"file". The request is passed through the
second new shadow node (let's call it the "x" shadow, as it is derived
from the original "file'',,x" shadow node), and the new "x" proxy node
(derived from the "file'',,x" proxy), and the first new shadow node (the
"file" shadow), and through the primary "file" proxy node finally
reaches the actual "file". There it gets the control port of "a". A
proxy of this control port is created, and passed to the "file" shadow
translator, which creates a shadow control... (ARGH!)

This is in turn passed to the intermediate "x" proxy, which creates
another proxy control port -- only to return that to the "x" shadow
translator, which shadows the whole thing again. A
shadow-proxy-shadow-proxy-control port for the "a" translator. Lovely.

Now the filter launches fsys_getroot() on this, again passing through
shadow and proxy and shadow and proxy, finally looking up the root node
of "a" (i.e. "file'"), which is passed up again -- we get

Again file_get_translator_cntl() resulting in
shadow-proxy-shadow-proxy-control of "b", and again fsys_getroot()
yielding shadow-proxy-shadow-proxy-"file''". Then yet another
file_get_translator_cntl(), giving us... Can you guess it?

Wait, not so hasty :-) This time actually something new happens,
something exciting, something wonderful: The request is passed down by
the "x" shadow and the "x" proxy as usual, but by the time it reaches
the "file''" shadow (the one to which "x" is attached), the routine is
broken: The shadow node realizes that the request is not for some
faceless node further down, but indeed for the very "file''" node it is
shadowing! So instead of passing the request down (which would get the
control port of "c"), the shadow node handles the request itself,
returning the control port of the translator attached to it: i.e. the
control port of "x".

This is again proxied by the "x" proxy, and shadowed by the "x" shadow

The following fsys_getroot() is again passed down by the "x" shadow and
the "x" proxy, and handled by the "file''" shadow: It returns the root
node of "x". The "x" proxy creates another proxy node; the "x" shadow
translator creates another shadow node.

The filter invokes file_get_translator_cntl() yet another time, this
time handled by the top-most (now only active) shadow node directly (the
"x" shadow, to which the filter is attached), returning the control port
of the filter itself. We are there at last.

Doesn't all this shadow-proxy-shadow-proxy-fun make your head spin, in
at least two directions at the same time? If it does, I might say that I
made a point... ;-)

While I believe that this approach would work, I'm sure you will agree
that it's horribly complicated and confusing. If the shadow nodes were
really implemented by different translators, it would be terribly
inefficient too, with all this proxying...

And it is actually not entirely correct: Let's say we use a filter to
skip "b" (and the rest of the stack on top of it). The filter would
traverse the stack through all this shadow machinery, and would end up
with a port to the root node of "a" that is still proxied by the various
shadow nodes. The filter should return a "clean" node, without any
shadowing on top of it; but the shadow nodes don't know when the filter
is done and passes the result to the client, so they can't put themselfs
out of the loop...

All these problems (both the correctness issue and the complexity)
result from the fact that the stack is traversed bottom-to-top: First
all underlying nodes need to be traversed, and only then the shadow node
takes action -- that's why it needs to proxy the other operations, so
it's still in control when its time finally arrives.

The whole thing would be infinitely simpler if we traversed the stack
top to bottom: The shadow node would just handle the first request, and
as soon as the client asks for the next lower translator in the stack,
it would just hand over to that one entirely, not having any reason to
interfere anymore.

And this is what I meant above about making a point: It seems that my
initial intuition about traversing top to bottom being simpler/more
elegant, now proves to have been merited...

Note that unlike what I originally suggested, this probably doesn't even
strictly need a new RPC implemented by all the translators, to ask for
the underlying node directly: As nsmux inserts proxy nodes at each level
of the translator stack, nsmux should always be able to provide the
underlying node, without the help of the actual translator...

(In fact I used this possibility in the scenario described above, when
initially wading down through the layers of dynamic translators, until
we finally get to the static stack...)


PS. Your mails contain both plain text and HTML versions of the content.
This unnecessarily bloats the messages, and is generally not seen
favourably on mailing lists. Please change the configuration of your
mail client to send plain text only.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]