[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A Critique of Shepherd Design
From: |
raid5atemyhomework |
Subject: |
Re: A Critique of Shepherd Design |
Date: |
Sat, 20 Mar 2021 11:10:53 +0000 |
Good rmoning Mark,
> Hi,
>
> raid5atemyhomework raid5atemyhomework@protonmail.com writes:
>
> > GNU Shepherd is the `init` system used by GNU Guix. It features:
> >
> > - A rich full Scheme language to describe actions.
> > - A simple core that is easy to maintain.
> >
> > However, in this critique, I contend that these features are bugs.
> > The Shepherd language for describing actions on Shepherd daemons is a
> > Turing-complete Guile language. Turing completeness runs afoul of the
> > Principle of Least Power. In principle, all that actions have to do
> > is invoke `exec`, `fork`, `kill`, and `waitpid` syscalls.
>
> These 4 calls are already enough to run "sleep 100000000000" and wait
> for it to finish, or to rebuild your Guix system with an extra patch
> added to glibc.
I agree. But this mechanism is intended to avoid stupid mistakes like what I
committed, not protect against an attacker who is capable of invoking `guix
system reconfigure` on arbitrary Scheme code (and can easily wrap anything
nefarious in any `unsafe-turing-complete` or `without-static-analysis` escape
mechanism). Seatbelts, not steel walls.
>
> > Yet the language is a full Turing-complete language, including the
> > major weakness of Turing-completeness: the inability to solve the
> > halting problem.
> > The fact that the halting problem is unsolved in the language means it
> > is possible to trivially write an infinite loop in the language. In
> > the context of an `init` system, the possibility of an infinite loop
> > is dangerous, as it means the system may never complete bootup.
>
> Limiting ourselves to strictly total functions wouldn't help much here,
> because for all practical purposes, computing 10^100 digits of Pi is
> just as bad as an infinite loop.
Indeed. Again, seatbelts, not steel walls. It's fairly difficult to commit a
mistake that causes you to accidentally write a program that computes 10^100
digits of pi, not so difficult to have a brain fart and use `(- count 1)`
instead of `(+ count 1)` because you were wondering idly whether an increment
or a decrement loop would be more Scemey or if both are just as Schemey as the
other.
What I propose would protect against the latter (a much more likely mistake),
as in-context the recursive loop would be flagged since the recursion would be
flagged due to being a call to a function that is not a member of a whitelist.
Hopefully getting recursive loops flagged would make the sysad writing
`configuration.scm` look for the "proper" way to wait for an event to be true,
and hopefully lead to them discovering the (hopefully extant) documentation on
whatever domain-specific language we have for waiting for the event to be true
instead of rolling their own.
> That said, I certainly agree that Shepherd could use improvement, and
> I'm glad that you've started this discussion.
>
> At a glance, your idea of having Shepherd do more within subprocesses
> looks promising to me, although this is not my area of expertise.
An issue here is that we sometimes pass data across Shepherd actions using
environment variables, which do not cross process boundaries. Xref. the
`set-http-proxy` of `guix-daemon`; the environment variable is used as a global
namespace that is accessible from both the `set-http-proxy` and `start` actions.
On the other hand, arguably the environment variable table is a global resource
shared amongst multiple shepherd daemons. This technique in general may not
scale well for large numbers of daemons; environment variable name conflicts
may cause subtle problems later. I think it would be better if in addition to
the "value" (typically the PID) each Shepherd service also had a `settings`
(which can be used to contain anything that satisfies `(lambda (x) (equal? x
(read (print x))))` so that it can be easily serialized across each subprocess
launched by each action) that can be read and modified by each action. Then
the `set-http-proxy` action would update this `settings` field for the shepherd
service, then queue up a `restart` action. It could by convention be an
association list.
This would also persist the `http_proxy` setting, BTW --- currently if you
`herd set-http-proxy guix-daemon <whatever>` and then `herd restart
guix-daemon` later, the HTTP proxy is lost (since the environment variable is
cleared after `set-http-proxy` restarts the `guix-daemon`). In short, this
`set-http-proxy` example looks like a fairly brittle hack anyway, and maybe
worth avoiding as a pattern.
Then there's actions that invoke other actions. From a cursory glance at the
Guix code it looks like only Ganeti and Guix-Daemon have actions that invoke
actions, and they only invoke actions on their own Shepherd services. It seems
to me safe for an action invoked in another action of the same service to *not*
spawn a new process, but to execute as the same process. Not sure how safe it
would be to allow one shepherd service to invoke an action on another shepherd
service --- but then the `start` action of any service may cause other services
it requires to be started as well, so we still do need to figure out what
subprocesses to launch or not launch.
Or maybe each Shepherd service has its own subprocess that is its own mainloop,
and the "main" Shepherd process mainloop "just" serves as a switching center to
forward commands to each service's mainloop-subprocess, and also incidentally
monitors per-service mainloop-subprocess that are not responding fast enough
(and possibly decide to kill those mainloops and all its children, then disable
that service). This would make each service's environment variables a
persistent but local store that is specific to each service and makes its use
in `guix-daemon` safe, and the `set-http-proxy` would simply not clear the env
vars so that the setting persists. This allows Shepherd to remain responsive
at all times even if some action of some Shepherd service enters an infloop or
10^100 pi digits condition; it could even have `herd status` report the number
of pending unhandled commands for each service to inform the sysad about
possible problems with specific services.
Thanks
raid5atemyhomework