l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug in task server startup code


From: Marcus Brinkmann
Subject: Re: bug in task server startup code
Date: Thu, 07 Oct 2004 20:59:15 +0200
User-agent: Wanderlust/2.10.1 (Watching The Wheels) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.3 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI)

At Tue, 17 Aug 2004 13:11:30 +0200,
Bas Wijnen <address@hidden> wrote:
> I did some more testing, added output code to wortel/startup.c (which is the
> startup code of all tasks started by wortel except physmem, so only the task
> server at the moment) and tried to see if the mappings it request from physmem
> (its startup and memory container) arrive with their correct data (I added
> some print statements to physmem as well.)
> 
> Well, they don't.  startup.c pagefaulted on the check, so I checked the result
> of the ipc which received the mapping.  It failed with error code 9, meaning
> "message overflow in the receive phase.  A message overflow can occur [string
> related stuff] and if a map/grant of an fpage fails because the system has not
> enough page-table space available." (L4 Reference Manual X.2 page 62)

Well, first of all, you should check if the mappings are at least
correct (or reasonable).  That's an important sanity check because
there might just be a bug in determining the fpages and their load
addresses.

However, an error 9 is interesting indeed.
 
> I am surprised by this.  If the page tables are too small already, then how
> can they be large enough for a normal system.  It would probably help if
> wortel grants pages to physmem instead of mapping them, but the limit seems to
> be reached much too soon, and while such a solution might make it bootable, it
> doesn't make it a usuable system.  It may be a good idea nontheless, though.
> I don't think wortel should hold all memory in the system.  It doesn't use it
> anyway.

Currently, we reserve 16 MB for the kernel's use.  See KMEM_SIZE in
laden/ia32-cmain.c.  I don't know how L4 uses this memory.  You might
try to define this to 32 MB or so, but it seems excessive for that
little we do.  If we have our fpages right, and L4 returns error 9, we
should look closer at L4, the page tables, and that stuff.

> Two last notes: If the mapping didn't succeed at all, the previous test
> (printing the bytes at the entry point) should have page faulted as well.  It
> didn't it just gave me different bytes than I expected.  The mappings to
> started tasks, such as the task server, are not at their "real" address, as I
> thought before, so that most probably was a mapping problem.  I didn't find it
> yet though, and don't know if I can reproduce it.  I'll try that.

The memory is mapped in fpages, maybe mapping one fpage, the one with
the addresses you accessed, worked, and another one failed?  Still,
you would not expect to see wrong data, and in your other mail you say
the offset was actually 0x1000 off or so, which would indicate to me a
bug in the ELF loader/startup mapping stuff.

The best way to track such things down is to track them down, line by
line, instruction by instruction.  It's slow work, but you can learn
something about the kernel debugger along the way :)

> Failing ipcs may corrupt the database of a capability server.  In this case,
> physmem thinks task has received the pages, because it is not notified of the
> failed ipc.  

This needs some consideration.  I think you have found one of the few
cases (maybe even the only one), where sending an IPC can actually
fail without either the sender or receiver being at fault (in a broad
sense).  OTOH, if there is no room for page tables anymore, you are in
deep shit.  Might as well panic and reboot at that point.

Mapping memory is to be considered a restricted operation.  We can
enforce that by using redirectors.  Neal thinks that's unnecessary
overhead, I think it's a necessary feature if you are paranoid to
protect against DoS attacks.  In any case, the only program ever doing
mappings should be physmem (and other trusted tasks like the device
server, maybe).  Unfortunately, physmem has no way at guessing how
much kernel memory is consumed by mappings it creates.  It's also a
figure that can change dramatically spontaneously: Imagine you have
mapped a superpage of 4MB, and you want to unmap just a single 4 KB
page at its end.  Boom, your page table explodes (this is why in L4,
you can officially only unmap whole fpages you have mapped previously
- although the implementation is allowed to show useful behaviour if
you unmap fpages partially).

For all other IPC failures I can think of, the story is actually quite
simple: It's either a programming bug in the server (fix the bug then)
or the fault of the client.  So, if the client doesn't correctly
receive a message, it doesn't know about the state of the server side
objects, but from the server side's perspective, everything is
consistent.  Of course, the only reasonable option for the client at
that point is to fail with an error message, as there is no way for it
to recover.  The server will then release its resources as normal.
Alternatively, if you are paranoid, you could also on the server side
revoke the caps from the client at error, which will also revert the
state to something well defined.  However, what's the use of that?  It
doesn't seem to do anything.  In a paranoid server, I'd maybe attempt
to do that, but not in a normal server.

If it is not clear to you why it is always the client's fault, I can
explain further.  Let me know which case you are interested in (simple
IPC, string items, map items).

Thanks,
Marcus





reply via email to

[Prev in Thread] Current Thread [Next in Thread]