[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A Critique of Shepherd Design
From: |
raid5atemyhomework |
Subject: |
Re: A Critique of Shepherd Design |
Date: |
Sun, 21 Mar 2021 00:22:09 +0000 |
Hello Ludo',
> Hi,
>
> raid5atemyhomework raid5atemyhomework@protonmail.com skribis:
>
> > Now, let us combine this with the second feature (really a bug): GNU
> > shepherd is a simple, single-threaded Scheme program. That means that
> > if the single thread enters an infinite loop (because of a Shepherd
> > service description that entered an infinite loop), then Shepherd
> > itself hangs.
>
> You’re right that it’s an issue; in practice, it’s okay because we pay
> attention to the code we run there, but obviously, mistakes could lead
> to the situation you describe.
>
> It’s a known problem and there are plans to address it, discussed on
> this list a few times before. The Shepherd “recently” switched to
> ‘signalfd’ for signal handling in the main loop, with an eye on making
> the whole loop event-driven:
>
> https://issues.guix.gnu.org/41507
>
> This will address this issue and unlock things like “socket activation”.
>
> That said, let’s not lie to ourselves: the Shepherd’s design is
> simplistic. I think that’s okay though because there’s a way to address
> the main issues while keeping it simple.
I'm not sure you can afford to keep it simple. Consider:
https://issues.guix.gnu.org/47253
In that issue, the `networking` provision comes up potentially *before* the
network is, in fact, up. This means that other daemons that require
`networking` could potentially be started before the network connection is up.
One example of such a daemon is `transmission-daemon`. This daemon will bind
itself to port 9091 so you can control it. Unfortunately, if it gets started
while network is down, it will be unable to bind to 9091 (so you can't control
it) but still keep running. On my system that means that on reboot I have to
manually `sudo herd restart trannsmission-daemon`.
In another example, I have a custom daemon that I have set up to use the Tor
proxy over 127.0.0.1:9050. It requires both `networking` and `tor`. When it
starts after `networking` comes up but before the actual network does, it dies
because it can't access the proxy at 127.0.0.1:9050 (apparently NetworkManager
handles loopback as well). Then shepherd respawns it, then it dies again
(network still not up) enough times that it gets disabled. This means that on
reboot I have to manually `sudo herd enable raid5atemyhomework-custom-daemon`
and `sudo herd restart raid5atemyhomework-custom-daemon`.
On SystemD-based systems, there's a `NetworkManager-network-online.service`
which just calls `nm-online -s -q --timeout=30`. This delays network-requiring
daemons until after the network is in fact actually up.
However in https://issues.guix.gnu.org/47253#1 Mark points out this is
undesirable in the Guix case since it could potentially stall the
(single-threaded) bootup process for up to 30 seconds if the network is
physically disconnected, a bad UX for desktop and laptop users (who might still
want to run `transmission-daemon`, BTW) because it potentially blocks the
initialization of X and make the computer unusable for such users for up to 30
seconds after boot. I note that I experienced such issues in some very old
Ubuntu installations, as well.
SystemD can afford to *always* have `nm-online -s -q --timeout=30` because it's
concurrent. The `network-online.service` will block, but other services like X
that don't ***need*** the network will continue booting. So the user can still
get to a usable system even if the boot isn't complete because the network
isn't up yet due to factors beyond the control of the operating system.
Switching to a concurrent design for Shepherd --- *any* concurrent design ---
is probably best done sooner rather than later, because it risks strongly
affecting customized `configuration.scm`s like mine that have almost a half
dozen custom Shepherd daemons.
Thanks
raid5atemyhomework