[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A Critique of Shepherd Design

From: raid5atemyhomework
Subject: Re: A Critique of Shepherd Design
Date: Sun, 21 Mar 2021 00:22:09 +0000

Hello Ludo',

> Hi,
> raid5atemyhomework skribis:
> > Now, let us combine this with the second feature (really a bug): GNU
> > shepherd is a simple, single-threaded Scheme program. That means that
> > if the single thread enters an infinite loop (because of a Shepherd
> > service description that entered an infinite loop), then Shepherd
> > itself hangs.
> You’re right that it’s an issue; in practice, it’s okay because we pay
> attention to the code we run there, but obviously, mistakes could lead
> to the situation you describe.
> It’s a known problem and there are plans to address it, discussed on
> this list a few times before. The Shepherd “recently” switched to
> ‘signalfd’ for signal handling in the main loop, with an eye on making
> the whole loop event-driven:
> This will address this issue and unlock things like “socket activation”.
> That said, let’s not lie to ourselves: the Shepherd’s design is
> simplistic. I think that’s okay though because there’s a way to address
> the main issues while keeping it simple.

I'm not sure you can afford to keep it simple.  Consider:

In that issue, the `networking` provision comes up potentially *before* the 
network is, in fact, up.  This means that other daemons that require 
`networking` could potentially be started before the network connection is up.

One example of such a daemon is `transmission-daemon`.  This daemon will bind 
itself to port 9091 so you can control it.  Unfortunately, if it gets started 
while network is down, it will be unable to bind to 9091 (so you can't control 
it) but still keep running.  On my system that means that on reboot I have to 
manually `sudo herd restart trannsmission-daemon`.

In another example, I have a custom daemon that I have set up to use the Tor 
proxy over  It requires both `networking` and `tor`.  When it 
starts after `networking` comes up but before the actual network does, it dies 
because it can't access the proxy at (apparently NetworkManager 
handles loopback as well).  Then shepherd respawns it, then it dies again 
(network still not up) enough times that it gets disabled.  This means that on 
reboot I have to manually `sudo herd enable raid5atemyhomework-custom-daemon` 
and `sudo herd restart raid5atemyhomework-custom-daemon`.

On SystemD-based systems, there's a `NetworkManager-network-online.service` 
which just calls `nm-online -s -q --timeout=30`.  This delays network-requiring 
daemons until after the network is in fact actually up.

However in Mark points out this is 
undesirable in the Guix case since it could potentially stall the 
(single-threaded) bootup process for up to 30 seconds if the network is 
physically disconnected, a bad UX for desktop and laptop users (who might still 
want to run `transmission-daemon`, BTW) because it potentially blocks the 
initialization of X and make the computer unusable for such users for up to 30 
seconds after boot.  I note that I experienced such issues in some very old 
Ubuntu installations, as well.

SystemD can afford to *always* have `nm-online -s -q --timeout=30` because it's 
concurrent.  The `network-online.service` will block, but other services like X 
that don't ***need*** the network will continue booting.  So the user can still 
get to a usable system even if the boot isn't complete because the network 
isn't up yet due to factors beyond the control of the operating system.

Switching to a concurrent design for Shepherd --- *any* concurrent design --- 
is probably best done sooner rather than later, because it risks strongly 
affecting customized `configuration.scm`s like mine that have almost a half 
dozen custom Shepherd daemons.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]