[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
A Critique of Shepherd Design
From: |
raid5atemyhomework |
Subject: |
A Critique of Shepherd Design |
Date: |
Fri, 19 Mar 2021 17:33:57 +0000 |
GNU Shepherd is the `init` system used by GNU Guix. It features:
* A rich full Scheme language to describe actions.
* A simple core that is easy to maintain.
However, in this critique, I contend that these features are bugs.
The Shepherd language for describing actions on Shepherd daemons is a
Turing-complete Guile language. Turing completeness runs afoul of the
Principle of Least Power. In principle, all that actions have to do is invoke
`exec`, `fork`, `kill`, and `waitpid` syscalls. Yet the language is a full
Turing-complete language, including the major weakness of Turing-completeness:
the inability to solve the halting problem.
The fact that the halting problem is unsolved in the language means it is
possible to trivially write an infinite loop in the language. In the context
of an `init` system, the possibility of an infinite loop is dangerous, as it
means the system may never complete bootup.
Now, let us combine this with the second feature (really a bug): GNU shepherd
is a simple, single-threaded Scheme program. That means that if the single
thread enters an infinite loop (because of a Shepherd service description that
entered an infinite loop), then Shepherd itself hangs. This means that the
system is severely broken. You cannot `sudo reboot` because that communicates
with the Shepherd daemon. You cannot even `kill 1` because signals are handled
in the mainloop, which never get entered if the `start` action of some Shepherd
daemon managed to enter an infinite loop.
I have experienced this firsthand since I wrote a Shepherd service to launch a
daemon, and I needed to wait for the daemon initialization to complete. My
implementation of this had a bug that caused an infinite loop to be entered,
but would only tickle this bug when a very particular (persistent on-disk)
state existed. I wrote this code about a month or so ago, and never got to
test it until last week, when the bug was tickled. Unfortunately, by then I
had removed older system versions that predated the bug. When I looked at a
backup copy of the `configuration.scm`, I discovered the bug soon afterwards.
But the amount of code verbiage needed here had apparently overwhelmed me at
the time I wrote the code to do the waiting, and the bug got into the code and
broke my system. I had to completely reinstall Guix (fortunately the backup
copy of `configuration.scm` was helpful in recovering most of the system, also
ZFS rocks).
Yes, I made a mistake. I'm only human. It should be easy to recover from
mistakes. The full Turing-completeness of the language invites mistakes, and
the single-threadedness makes the infinite-loop mistakes that
Turing-completeness invites, into catastrophic system breakages.
So what can we do?
For one, a Turing-complete language is a strict *superset* of
non-Turing-complete languages. So one thing we can do is to define a more
dedicated language for Shepherd actions, strongly encourage the use of that
sub-language, and, at some point, require that truly Turing-complete actions
need to be wrapped in a `(unsafe-turing-complete ...)` form.
For example, in addition to the current existing `make-forkexec-constructor`
and `make-kill-destructor`, let us also add these syntax forms:
`(wait-for-condition <timeout> <form>)` - Return a procedure that accepts a
numeric `pid`, that does: Check if evaluating `<form>` in the lexical context
results in `#f`. If so, wait one second and re-evaluate, but exit anyway and
return `#f` if `<timeout>` seconds has passed, or if the `pid` has died. If
`<form>` evaluates to non-`#f` then return it immediately.
`(timed-action <timeout> <form> ...)` - Return a procedure that accepts a
numeric `pid`, that does: Spawn a new thread (or maybe better a `fork`ed
process group?). The new thread evaluates `<form> ...` in sequence. If the
thread completes or throws in `<timeout>` seconds, return the result or throw
the exception in the main thread. IF the `<timeout>` is reached or the given
`pid` has died, kill the thread and any processes it may have spawned, then
return `#f`.
`(wait-to-die <timeout>)` - Return a procedure that accepts a `pid` that does:
Check if the `pid` has died, if so, return #t. Else sleep 1 second and try
again, and if `<timeout>` is reached, return `#f`.
The above forms should also report as well what they are doing (e.g. `Have been
waiting 10 of 30 seconds for <form>`) on the console and/or syslog.
In addition, `make-forkexec-constructor` should accept a `#:post-fork`, which
is a procedure that accepts a `pid`, and `make-kill-destructor` should accept a
`#:pre-kill`, which is a procedure that accepts a `pid`. Possibly we need to
add combinators for the above actions as well. For example a `sub-actions`
procedural form that accepts any number of functions of the above `pid -> bool`
type and creates a single combined `pid -> bool`.
So for example, in my case, I would have written a `make-forkexec-constructor`
that accepted a `#:post-fork` that had a `wait-for-condition` on the code that
checked if the daemon initialization completed correctly.
I think it is a common enough pattern that:
* We have to spawn a daemon.
* We have to wait for the daemon to be properly initialized (`#:post-fork`)
* When shutting down the daemon, it's best to at least try to politely ask it
to finish using whatever means the daemon has (`#:pre-kill`).
* If the daemon doesn't exit soon after we politely asked it to, be less polite
and start sending signals.
So I think the above should cover a good number of necessities.
Then, we can define a strict subset of Guile, containing a set of forms we know
are total (i.e. we have a proof they will always halt for all inputs). Then
any Shepherd actions, being Lisp code and therefore homoiconic, can be
analyzed. Every list must have a `car` that is a symbol naming a syntax or
procedure that is known to be safe --- `list`, `append`, `cons*`, `cons`,
`string-append`, `string-join`, `quote` (which also prevents analysis of
sub-lists), `and`, `or`, `begin`, thunk combinators, as well as the
domain-specific `make-forkexec-constructor`, `make-kill-destructor`,
`wait-for-condition`, `timed-action`, and probably some of the `(guix build
utils)` like `invoke`, `invoke/quiet`, `mkdir-p` etc.
Sub-forms (or the entire form for an action) can be wrapped in
`(unsafe-turing-complete ...)` to skip this analysis for the sub-form, but
otherwise, by default, the specific subset must be used, and users have to
explicitly put `(unsafe-turing-complete ...)` so they are aware that they can
seriously break their system if they make a mistake here. Ideally, as much of
the code for any Shepherd action should be *outside* an
``unsafe-turing-complete`, and only parts of the code that really need the full
Guile language to implement should be rapped in `unsafe-turing-complete`.
(`lambda` and `let-syntax` and `let`, because they can rebind the meanings of
symbols, would need to be in `unsafe-turing-complete` --- otherwise the
analysis routine would require a full syntax expander as well)
Turing-completeness is a weakness, not a strength, and restricting languages to
be the Least Powerful necessary is better. The `unsafe-turing-complete` form
allows an escape, but by default Shepherd code should be in the restricted
non-Turing-complete subset, to reduce the scope for catastrophic mistakes.
Thanks
raid5atemyhomework
- A Critique of Shepherd Design,
raid5atemyhomework <=