[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A Critique of Shepherd Design
Re: A Critique of Shepherd Design
Sat, 20 Mar 2021 00:02:36 -0400
Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
raid5atemyhomework <firstname.lastname@example.org> writes:
> GNU Shepherd is the `init` system used by GNU Guix. It features:
> * A rich full Scheme language to describe actions.
> * A simple core that is easy to maintain.
> However, in this critique, I contend that these features are bugs.
> The Shepherd language for describing actions on Shepherd daemons is a
> Turing-complete Guile language. Turing completeness runs afoul of the
> Principle of Least Power. In principle, all that actions have to do
> is invoke `exec`, `fork`, `kill`, and `waitpid` syscalls. Yet the
> language is a full Turing-complete language, including the major
> weakness of Turing-completeness: the inability to solve the halting
> The fact that the halting problem is unsolved in the language means it
> is possible to trivially write an infinite loop in the language. In
> the context of an `init` system, the possibility of an infinite loop
> is dangerous, as it means the system may never complete bootup.
> Now, let us combine this with the second feature (really a bug): GNU
> shepherd is a simple, single-threaded Scheme program. That means that
> if the single thread enters an infinite loop (because of a Shepherd
> service description that entered an infinite loop), then Shepherd
> itself hangs. This means that the system is severely broken. You
> cannot `sudo reboot` because that communicates with the Shepherd
> daemon. You cannot even `kill 1` because signals are handled in the
> mainloop, which never get entered if the `start` action of some
> Shepherd daemon managed to enter an infinite loop.
> I have experienced this firsthand since I wrote a Shepherd service to
> launch a daemon, and I needed to wait for the daemon initialization to
> complete. My implementation of this had a bug that caused an infinite
> loop to be entered, but would only tickle this bug when a very
> particular (persistent on-disk) state existed. I wrote this code
> about a month or so ago, and never got to test it until last week,
> when the bug was tickled. Unfortunately, by then I had removed older
> system versions that predated the bug. When I looked at a backup copy
> of the `configuration.scm`, I discovered the bug soon afterwards. But
> the amount of code verbiage needed here had apparently overwhelmed me
> at the time I wrote the code to do the waiting, and the bug got into
> the code and broke my system. I had to completely reinstall Guix
> (fortunately the backup copy of `configuration.scm` was helpful in
> recovering most of the system, also ZFS rocks).
> Yes, I made a mistake. I'm only human. It should be easy to recover
> from mistakes. The full Turing-completeness of the language invites
> mistakes, and the single-threadedness makes the infinite-loop mistakes
> that Turing-completeness invites, into catastrophic system breakages.
> So what can we do?
> For one, a Turing-complete language is a strict *superset* of
> non-Turing-complete languages. So one thing we can do is to define a
> more dedicated language for Shepherd actions, strongly encourage the
> use of that sub-language, and, at some point, require that truly
> Turing-complete actions need to be wrapped in a
> `(unsafe-turing-complete ...)` form.
> For example, in addition to the current existing
> `make-forkexec-constructor` and `make-kill-destructor`, let us also
> add these syntax forms:
> `(wait-for-condition <timeout> <form>)` - Return a procedure that
> accepts a numeric `pid`, that does: Check if evaluating `<form>` in
> the lexical context results in `#f`. If so, wait one second and
> re-evaluate, but exit anyway and return `#f` if `<timeout>` seconds
> has passed, or if the `pid` has died. If `<form>` evaluates to
> non-`#f` then return it immediately.
> `(timed-action <timeout> <form> ...)` - Return a procedure that
> accepts a numeric `pid`, that does: Spawn a new thread (or maybe
> better a `fork`ed process group?). The new thread evaluates `<form>
> ...` in sequence. If the thread completes or throws in `<timeout>`
> seconds, return the result or throw the exception in the main thread.
> IF the `<timeout>` is reached or the given `pid` has died, kill the
> thread and any processes it may have spawned, then return `#f`.
> `(wait-to-die <timeout>)` - Return a procedure that accepts a `pid`
> that does: Check if the `pid` has died, if so, return #t. Else sleep
> 1 second and try again, and if `<timeout>` is reached, return `#f`.
> The above forms should also report as well what they are doing
> (e.g. `Have been waiting 10 of 30 seconds for <form>`) on the console
> and/or syslog.
> In addition, `make-forkexec-constructor` should accept a
> `#:post-fork`, which is a procedure that accepts a `pid`, and
> `make-kill-destructor` should accept a `#:pre-kill`, which is a
> procedure that accepts a `pid`. Possibly we need to add combinators
> for the above actions as well. For example a `sub-actions` procedural
> form that accepts any number of functions of the above `pid -> bool`
> type and creates a single combined `pid -> bool`.
> So for example, in my case, I would have written a
> `make-forkexec-constructor` that accepted a `#:post-fork` that had a
> `wait-for-condition` on the code that checked if the daemon
> initialization completed correctly.
> I think it is a common enough pattern that:
> * We have to spawn a daemon.
> * We have to wait for the daemon to be properly initialized (`#:post-fork`)
> * When shutting down the daemon, it's best to at least try to politely
> ask it to finish using whatever means the daemon has (`#:pre-kill`).
> * If the daemon doesn't exit soon after we politely asked it to, be
> less polite and start sending signals.
> So I think the above should cover a good number of necessities.
> Then, we can define a strict subset of Guile, containing a set of
> forms we know are total (i.e. we have a proof they will always halt
> for all inputs). Then any Shepherd actions, being Lisp code and
> therefore homoiconic, can be analyzed. Every list must have a `car`
> that is a symbol naming a syntax or procedure that is known to be safe
> --- `list`, `append`, `cons*`, `cons`, `string-append`, `string-join`,
> `quote` (which also prevents analysis of sub-lists), `and`, `or`,
> `begin`, thunk combinators, as well as the domain-specific
> `make-forkexec-constructor`, `make-kill-destructor`,
> `wait-for-condition`, `timed-action`, and probably some of the `(guix
> build utils)` like `invoke`, `invoke/quiet`, `mkdir-p` etc.
> Sub-forms (or the entire form for an action) can be wrapped in
> `(unsafe-turing-complete ...)` to skip this analysis for the sub-form,
> but otherwise, by default, the specific subset must be used, and users
> have to explicitly put `(unsafe-turing-complete ...)` so they are
> aware that they can seriously break their system if they make a
> mistake here. Ideally, as much of the code for any Shepherd action
> should be *outside* an ``unsafe-turing-complete`, and only parts of
> the code that really need the full Guile language to implement should
> be rapped in `unsafe-turing-complete`.
> (`lambda` and `let-syntax` and `let`, because they can rebind the
> meanings of symbols, would need to be in `unsafe-turing-complete` ---
> otherwise the analysis routine would require a full syntax expander as
> Turing-completeness is a weakness, not a strength, and restricting
> languages to be the Least Powerful necessary is better. The
> `unsafe-turing-complete` form allows an escape, but by default
> Shepherd code should be in the restricted non-Turing-complete subset,
> to reduce the scope for catastrophic mistakes.
Thanks for the well written critique. Turing-completeness is a two
edged sword; with great power comes great responsibility :-). The same
could be said of our use of Guile for writing package definitions, but
in a context where mistakes are easier to catch.
Your proposal is also interesting... I'm not sure if I like it yet but
I'll keep thinking about it. The reason I'm somewhat reluctant is for
the same reason I think using Guile to write package definitions is
great: maximum expressiveness and flexibility. If we were to try and
start restricting that core feature, it wouldn't really be Shepherd
anymore. Other init systems such as systemd keep their descriptive
language minimal but accommodate running pre or post actions (typically
scripts) very easily (but with restrictions about how long they can run
in the case of systemd IIRC).
Thanks for having shared your analysis!