[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A Critique of Shepherd Design

From: raid5atemyhomework
Subject: Re: A Critique of Shepherd Design
Date: Sat, 20 Mar 2021 11:10:53 +0000

Good rmoning Mark,

> Hi,
> raid5atemyhomework writes:
> > GNU Shepherd is the `init` system used by GNU Guix. It features:
> >
> > -   A rich full Scheme language to describe actions.
> > -   A simple core that is easy to maintain.
> >
> > However, in this critique, I contend that these features are bugs.
> > The Shepherd language for describing actions on Shepherd daemons is a
> > Turing-complete Guile language. Turing completeness runs afoul of the
> > Principle of Least Power. In principle, all that actions have to do
> > is invoke `exec`, `fork`, `kill`, and `waitpid` syscalls.
> These 4 calls are already enough to run "sleep 100000000000" and wait
> for it to finish, or to rebuild your Guix system with an extra patch
> added to glibc.

I agree.  But this mechanism is intended to avoid stupid mistakes like what I 
committed, not protect against an attacker who is capable of invoking `guix 
system reconfigure` on arbitrary Scheme code (and can easily wrap anything 
nefarious in any `unsafe-turing-complete` or `without-static-analysis` escape 
mechanism).  Seatbelts, not steel walls.

> > Yet the language is a full Turing-complete language, including the
> > major weakness of Turing-completeness: the inability to solve the
> > halting problem.
> > The fact that the halting problem is unsolved in the language means it
> > is possible to trivially write an infinite loop in the language. In
> > the context of an `init` system, the possibility of an infinite loop
> > is dangerous, as it means the system may never complete bootup.
> Limiting ourselves to strictly total functions wouldn't help much here,
> because for all practical purposes, computing 10^100 digits of Pi is
> just as bad as an infinite loop.

Indeed.  Again, seatbelts, not steel walls.  It's fairly difficult to commit a 
mistake that causes you to accidentally write a program that computes 10^100 
digits of pi, not so difficult to have a brain fart and use `(- count 1)` 
instead of `(+ count 1)` because you were wondering idly whether an increment 
or a decrement loop would be more Scemey or if both are just as Schemey as the 

What I propose would protect against the latter (a much more likely mistake), 
as in-context the recursive loop would be flagged since the recursion would be 
flagged due to being a call to a function that is not a member of a whitelist.  
Hopefully getting recursive loops flagged would make the sysad writing 
`configuration.scm` look for the "proper" way to wait for an event to be true, 
and hopefully lead to them discovering the (hopefully extant) documentation on 
whatever domain-specific language we have for waiting for the event to be true 
instead of rolling their own.

> That said, I certainly agree that Shepherd could use improvement, and
> I'm glad that you've started this discussion.
> At a glance, your idea of having Shepherd do more within subprocesses
> looks promising to me, although this is not my area of expertise.

An issue here is that we sometimes pass data across Shepherd actions using 
environment variables, which do not cross process boundaries.  Xref. the 
`set-http-proxy` of `guix-daemon`; the environment variable is used as a global 
namespace that is accessible from both the `set-http-proxy` and `start` actions.

On the other hand, arguably the environment variable table is a global resource 
shared amongst multiple shepherd daemons.  This technique in general may not 
scale well for large numbers of daemons; environment variable name conflicts 
may cause subtle problems later.  I think it would be better if in addition to 
the "value" (typically the PID) each Shepherd service also had a `settings` 
(which can be used to contain anything that satisfies `(lambda (x) (equal? x 
(read (print x))))` so that it can be easily serialized across each subprocess 
launched by each action) that can be read and modified by each action.  Then 
the `set-http-proxy` action would update this `settings` field for the shepherd 
service, then queue up a `restart` action.  It could by convention be an 
association list.

This would also persist the `http_proxy` setting, BTW --- currently if you 
`herd set-http-proxy guix-daemon <whatever>` and then `herd restart 
guix-daemon` later, the HTTP proxy is lost (since the environment variable is 
cleared after `set-http-proxy` restarts the `guix-daemon`).  In short, this 
`set-http-proxy` example looks like a fairly brittle hack anyway, and maybe 
worth avoiding as a pattern.

Then there's actions that invoke other actions.  From a cursory glance at the 
Guix code it looks like only Ganeti and Guix-Daemon have actions that invoke 
actions, and they only invoke actions on their own Shepherd services.  It seems 
to me safe for an action invoked in another action of the same service to *not* 
spawn a new process, but to execute as the same process.  Not sure how safe it 
would be to allow one shepherd service to invoke an action on another shepherd 
service --- but then the `start` action of any service may cause other services 
it requires to be started as well, so we still do need to figure out what 
subprocesses to launch or not launch.

Or maybe each Shepherd service has its own subprocess that is its own mainloop, 
and the "main" Shepherd process mainloop "just" serves as a switching center to 
forward commands to each service's mainloop-subprocess, and also incidentally 
monitors per-service mainloop-subprocess that are not responding fast enough 
(and possibly decide to kill those mainloops and all its children, then disable 
that service).  This would make each service's environment variables a 
persistent but local store that is specific to each service and makes its use 
in `guix-daemon` safe, and the `set-http-proxy` would simply not clear the env 
vars so that the setting persists.  This allows Shepherd to remain responsive 
at all times even if some action of some Shepherd service enters an infloop or 
10^100 pi digits condition; it could even have `herd status` report the number 
of pending unhandled commands for each service to inform the sysad about 
possible problems with specific services.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]