[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

A Critique of Shepherd Design

From: raid5atemyhomework
Subject: A Critique of Shepherd Design
Date: Fri, 19 Mar 2021 17:33:57 +0000

GNU Shepherd is the `init` system used by GNU Guix.  It features:

* A rich full Scheme language to describe actions.
* A simple core that is easy to maintain.

However, in this critique, I contend that these features are bugs.

The Shepherd language for describing actions on Shepherd daemons is a 
Turing-complete Guile language.  Turing completeness runs afoul of the 
Principle of Least Power.  In principle, all that actions have to do is invoke 
`exec`, `fork`, `kill`, and `waitpid` syscalls.  Yet the language is a full 
Turing-complete language, including the major weakness of Turing-completeness: 
the inability to solve the halting problem.

The fact that the halting problem is unsolved in the language means it is 
possible to trivially write an infinite loop in the language.  In the context 
of an `init` system, the possibility of an infinite loop is dangerous, as it 
means the system may never complete bootup.

Now, let us combine this with the second feature (really a bug): GNU shepherd 
is a simple, single-threaded Scheme program.  That means that if the single 
thread enters an infinite loop (because of a Shepherd service description that 
entered an infinite loop), then Shepherd itself hangs.  This means that the 
system is severely broken.  You cannot `sudo reboot` because that communicates 
with the Shepherd daemon.  You cannot even `kill 1` because signals are handled 
in the mainloop, which never get entered if the `start` action of some Shepherd 
daemon managed to enter an infinite loop.

I have experienced this firsthand since I wrote a Shepherd service to launch a 
daemon, and I needed to wait for the daemon initialization to complete.  My 
implementation of this had a bug that caused an infinite loop to be entered, 
but would only tickle this bug when a very particular (persistent on-disk) 
state existed.  I wrote this code about a month or so ago, and never got to 
test it until last week, when the bug was tickled.  Unfortunately, by then I 
had removed older system versions that predated the bug.  When I looked at a 
backup copy of the `configuration.scm`, I discovered the bug soon afterwards.  
But the amount of code verbiage needed here had apparently overwhelmed me at 
the time I wrote the code to do the waiting, and the bug got into the code and 
broke my system.  I had to completely reinstall Guix (fortunately the backup 
copy of `configuration.scm` was helpful in recovering most of the system, also 
ZFS rocks).

Yes, I made a mistake.  I'm only human.  It should be easy to recover from 
mistakes.  The full Turing-completeness of the language invites mistakes, and 
the single-threadedness makes the infinite-loop mistakes that 
Turing-completeness invites, into catastrophic system breakages.

So what can we do?

For one, a Turing-complete language is a strict *superset* of 
non-Turing-complete languages.  So one thing we can do is to define a more 
dedicated language for Shepherd actions, strongly encourage the use of that 
sub-language, and, at some point, require that truly Turing-complete actions 
need to be wrapped in a `(unsafe-turing-complete ...)` form.

For example, in addition to the current existing `make-forkexec-constructor` 
and `make-kill-destructor`, let us also add these syntax forms:

`(wait-for-condition <timeout> <form>)` - Return a procedure that accepts a 
numeric `pid`, that does: Check if evaluating `<form>` in the lexical context 
results in `#f`.  If so, wait one second and re-evaluate, but exit anyway and 
return `#f` if `<timeout>` seconds has passed, or if the `pid` has died.  If 
`<form>` evaluates to non-`#f` then return it immediately.

`(timed-action <timeout> <form> ...)` - Return a procedure that accepts a 
numeric `pid`, that does: Spawn a new thread (or maybe better a `fork`ed 
process group?).  The new thread evaluates `<form> ...` in sequence.  If the 
thread completes or throws in `<timeout>` seconds, return the result or throw 
the exception in the main thread.  IF the `<timeout>` is reached or the given 
`pid` has died, kill the thread and any processes it may have spawned, then 
return `#f`.

`(wait-to-die <timeout>)` - Return a procedure that accepts a `pid` that does: 
Check if the `pid` has died, if so, return #t.  Else sleep 1 second and try 
again, and if `<timeout>` is reached, return `#f`.

The above forms should also report as well what they are doing (e.g. `Have been 
waiting 10 of 30 seconds for <form>`) on the console and/or syslog.

In addition, `make-forkexec-constructor` should accept a `#:post-fork`, which 
is a procedure that accepts a `pid`, and `make-kill-destructor` should accept a 
`#:pre-kill`, which is a procedure that accepts a `pid`.  Possibly we need to 
add combinators for the above actions as well.  For example a `sub-actions` 
procedural form that accepts any number of functions of the above `pid -> bool` 
type and creates a single combined `pid -> bool`.

So for example, in my case, I would have written a `make-forkexec-constructor` 
that accepted a `#:post-fork` that had a `wait-for-condition` on the code that 
checked if the daemon initialization completed correctly.

I think it is a common enough pattern that:

* We have to spawn a daemon.
* We have to wait for the daemon to be properly initialized (`#:post-fork`)
* When shutting down the daemon, it's best to at least try to politely ask it 
to finish using whatever means the daemon has (`#:pre-kill`).
* If the daemon doesn't exit soon after we politely asked it to, be less polite 
and start sending signals.

So I think the above should cover a good number of necessities.

Then, we can define a strict subset of Guile, containing a set of forms we know 
are total (i.e. we have a proof they will always halt for all inputs).  Then 
any Shepherd actions, being Lisp code and therefore homoiconic, can be 
analyzed.  Every list must have a `car` that is a symbol naming a syntax or 
procedure that is known to be safe --- `list`, `append`, `cons*`, `cons`, 
`string-append`, `string-join`, `quote` (which also prevents analysis of 
sub-lists), `and`, `or`, `begin`, thunk combinators, as well as the 
domain-specific `make-forkexec-constructor`, `make-kill-destructor`, 
`wait-for-condition`, `timed-action`, and probably some of the `(guix build 
utils)` like `invoke`, `invoke/quiet`, `mkdir-p` etc.

Sub-forms (or the entire form for an action) can be wrapped in 
`(unsafe-turing-complete ...)` to skip this analysis for the sub-form, but 
otherwise, by default, the specific subset must be used, and users have to 
explicitly put `(unsafe-turing-complete ...)` so they are aware that they can 
seriously break their system if they make a mistake here.  Ideally, as much of 
the code for any Shepherd action should be *outside* an 
``unsafe-turing-complete`, and only parts of the code that really need the full 
Guile language to implement should be rapped in `unsafe-turing-complete`.

(`lambda` and `let-syntax` and `let`, because they can rebind the meanings of 
symbols, would need to be in `unsafe-turing-complete` --- otherwise the 
analysis routine would require a full syntax expander as well)

Turing-completeness is a weakness, not a strength, and restricting languages to 
be the Least Powerful necessary is better.  The `unsafe-turing-complete` form 
allows an escape, but by default Shepherd code should be in the restricted 
non-Turing-complete subset, to reduce the scope for catastrophic mistakes.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]