guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services


From: Carlo Zancanaro
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Sat, 03 Mar 2018 18:58:50 +1100
User-agent: mu4e 1.0; emacs 25.3.1

Hey Ludo,

I've re-written my patch, and it's attached in two commits. The first one adds the necessary calls to prctl, and the second adds the fallback to polling.

On Fri, Mar 02 2018, Ludovic Courtès wrote:
The ‘prctl’ procedure itself should simply throw to 'system-error on GNU/Hurd. But then, in (shepherd), we could add the polling thing when (not (string-contains %host-type "linux")).

WDYT?

I don't like the idea of doing this based on the host type. In my patch I've done it based on whether the prctl call succeeded. If the prctl call throws a system-error then we poll, otherwise we rely on SIGCHLD. I don't have a system set up with another kernel, though, so I don't know how I can easily test whether the fallback logic is working properly. When I replaced the prctl call with (throw 'system-error) it seemed to work.

The fallback code still fails in the guix build environment (as my previous patch did), but when it's using prctl it works properly. This means that a build on Linux pre-3.4, or on Hurd, will fail, which probably isn't acceptable given that shepherd is a hard dependency for starting a GuixSD system. As far as I can tell the test fails because the processes stick around as zombies, so I assume that pid 1 in the build container isn't properly reaping zombie processes. Do you have any ideas how I can force it to do so?

We want to set the “reaper” of child processes to Shepherd itself, so we must do that in child processes. The shepherd process cannot be its own reaper I suppose.

Reading the manpage, and then running some code, I think you're wrong about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling process as a child subreaper. That means that any processes that are orphaned below the current process get reparented under the current process (or a closer child subreaper, if there's one further down). If there are no processes marked as child subreapers, then the orphaned process is reparented under pid 1. We should only need to call prctl in two places: when shepherd initially starts, or when we fork for daemonize.

Carlo

From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <address@hidden>
Date: Sat, 3 Mar 2018 17:26:05 +1100
Subject: [PATCH 1/2] Handle forked process SIGCHLD signals

* Makefile.am (TESTS): Add tests/forking-service.sh.
* configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
* modules/shepherd.scm: Set the child subreaper attribute of main shepherd
  process (as long as we're not pid 1).
* modules/shepherd/service.scm (root-service)[daemonize]: Set the child
  subreaper attribute of newly forked shepherd process.
* modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
  and export it.
  (prctl): Add new procedure and export it.
---
 Makefile.am                    |   1 +
 configure.ac                   |   4 ++
 modules/shepherd.scm           |   2 +
 modules/shepherd/service.scm   |   4 +-
 modules/shepherd/system.scm.in |  17 ++++++-
 tests/forking-service.sh       | 111 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 137 insertions(+), 2 deletions(-)
 create mode 100644 tests/forking-service.sh

diff --git a/Makefile.am b/Makefile.am
index eafa308..8dad006 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -190,6 +190,7 @@ TESTS =                                             \
   tests/no-home.sh                             \
   tests/pid-file.sh                            \
   tests/status-sexp.sh                         \
+  tests/forking-service.sh                     \
   tests/signals.sh
 
 TEST_EXTENSIONS = .sh
diff --git a/configure.ac b/configure.ac
index bb5058d..fbe16f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -72,7 +72,11 @@ esac
 AC_SUBST([RB_AUTOBOOT])
 AC_SUBST([RB_HALT_SYSTEM])
 AC_SUBST([RB_POWER_OFF])
+AC_MSG_RESULT([done])
 
+AC_MSG_CHECKING([<sys/prctl.h> constants])
+AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#include 
<sys/prctl.h>])
+AC_SUBST([PR_SET_CHILD_SUBREAPER])
 AC_MSG_RESULT([done])
 
 dnl Manual pages.
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index df5420f..ab59e08 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -50,6 +50,8 @@
 ;; Main program.
 (define (main . args)
   (initialize-cli)
+  (when (not (= 1 (getpid)))
+    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
 
   (let ((config-file #f)
        (socket-file default-socket-file)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 2224932..b6394f2 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -1274,7 +1274,9 @@ we want to receive these signals."
            (local-output "Running as PID 1, so not daemonizing."))
           (else
            (if (zero? (primitive-fork))
-               #t
+               (begin
+                 (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))
+                 #t)
                (primitive-exit 0))))))
      (persistency
       "Safe the current state of running and non-running services.
diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in
index a54dca7..55806cb 100644
--- a/modules/shepherd/system.scm.in
+++ b/modules/shepherd/system.scm.in
@@ -23,7 +23,9 @@
   #:export (reboot
             halt
             power-off
-            max-file-descriptors))
+            max-file-descriptors
+            prctl
+            PR_SET_CHILD_SUBREAPER))
 
 ;; The <sys/reboot.h> constants.
 (define RB_AUTOBOOT @RB_AUTOBOOT@)
@@ -130,6 +132,19 @@ the returned procedure is called."
                    (list err))
             result)))))
 
+(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@)
+
+(define prctl
+  (let ((proc (syscall->procedure long "prctl" (list int int))))
+    (lambda (process operation)
+      "Perform an operation on the given process"
+      (let-values (((result err) (proc process operation)))
+        (if (= -1 result)
+            (throw 'system-error "prctl" "~A: ~S"
+                   (list (strerror err) name)
+                   (list err))
+            result)))))
+
 (define (max-file-descriptors)
   "Return the maximum number of open file descriptors allowed."
   (sysconf _SC_OPEN_MAX))
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
new file mode 100644
index 0000000..90c684a
--- /dev/null
+++ b/tests/forking-service.sh
@@ -0,0 +1,111 @@
+# GNU Shepherd --- Test detecting a forked process' termination
+# Copyright © 2016 Ludovic Courtès <address@hidden>
+# Copyright © 2018 Carlo Zancanaro <address@hidden>
+#
+# This file is part of the GNU Shepherd.
+#
+# The GNU Shepherd is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or (at
+# your option) any later version.
+#
+# The GNU Shepherd is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with the GNU Shepherd.  If not, see <http://www.gnu.org/licenses/>.
+
+shepherd --version
+herd --version
+
+socket="t-socket-$$"
+conf="t-conf-$$"
+log="t-log-$$"
+pid="t-pid-$$"
+service_pid="t-service-pid-$$"
+service2_pid="t-service2-pid-$$"
+service2_started="t-service2-starts-$$"
+
+herd="herd -s $socket"
+
+function cleanup() {
+    cat $log || true
+    rm -f $socket $conf $log $service2_started
+    test -f $pid && kill "$(cat $pid)" || true
+    rm -f $pid
+    test -f $service_pid && kill "$(cat $service_pid)" || true
+    rm -f $service_pid
+    test -f $service2_pid && kill "$(cat $service2_pid)" || true
+    rm -f $service2_pid
+}
+
+trap cleanup EXIT
+
+cat > "$conf"<<EOF
+(define %command
+  '("$SHELL" "-c" "sleep 600 & echo \$! > $PWD/$service_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test)
+   #:start (make-forkexec-constructor %command
+                                      #:pid-file "$PWD/$service_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #f))
+
+(define %command2
+  '("$SHELL" "-c" "echo started >> $PWD/$service2_started; sleep 600 & echo 
\$! > $PWD/$service2_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test2)
+   #:start (make-forkexec-constructor %command2
+                                      #:pid-file "$PWD/$service2_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #t))
+EOF
+cat $conf
+
+rm -f "$pid"
+shepherd -I -s "$socket" -c "$conf" -l "$log" --pid="$pid" &
+
+# Wait till it's ready.
+while ! test -f "$pid" ; do sleep 0.3 ; done
+
+shepherd_pid="$(cat $pid)"
+
+# start both of the services
+$herd start test
+$herd start test2
+
+# make sure "test" is started
+until $herd status test | grep started; do sleep 0.3; done
+test -f "$service_pid"
+service_pid_value="$(cat $service_pid)"
+# now kill it
+kill "$service_pid_value"
+while kill -0 "$service_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped within one second
+sleep 1
+$herd status test | grep stopped
+
+
+
+# make sure "test2" has started
+until $herd status test2 | grep started; do sleep 0.3; done
+test -f "$service2_pid"
+service2_pid_value="$(cat $service2_pid)"
+test "$(cat $PWD/$service2_started)" = "started"
+# now kill it
+rm -f "$service2_pid"
+kill $service2_pid_value
+while kill -0 "$service2_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped, and restart it, within 
one second
+sleep 1;
+$herd status test2 | grep started
+test "$(cat $PWD/$service2_started)" = "started
+started"
-- 
2.16.1

From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <address@hidden>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
---
 modules/shepherd.scm         | 31 ++++++++++++++----
 modules/shepherd/service.scm | 78 +++++++++++++++++++++++++++-----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 4 files changed, 76 insertions(+), 41 deletions(-)

diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index ab59e08..b824546 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,9 +51,17 @@

 ;; Main program.
 (define (main . args)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f)
+        (catch 'system-error
+          (lambda ()
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda (key . args)
+            check-for-dead-services))))
+
   (initialize-cli)
-  (when (not (= 1 (getpid)))
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
 
   (let ((config-file #f)
        (socket-file default-socket-file)
@@ -220,11 +230,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..fc53d76 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <address@hidden>
 ;; Copyright (C) 2014 Alex Sassmannshausen <address@hidden>
 ;; Copyright (C) 2016 Alex Kost <address@hidden>
+;; Copyright (C) 2018 Carlo Zancanaro <address@hidden>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1009,39 +1012,42 @@ otherwise by updating its state."
 
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
-         (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+         (respawn-service serv)
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  (when serv
+    (slot-set! serv 'running #f)
+    (if (and (respawn? serv)
+             (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                      (car respawn-limit)
+                                      (cdr respawn-limit))))
+        (if (not (slot-ref serv 'waiting-for-termination?))
+            (begin
+              ;; Everything is okay, start it.
+              (local-output "Respawning ~a."
+                            (canonical-name serv))
+              (slot-set! serv 'last-respawns
+                         (cons (current-time)
+                               (slot-ref serv 'last-respawns)))
+              (start serv))
+            ;; We have just been waiting for the
+            ;; termination.  The `running' slot has already
+            ;; been set to `#f' by `stop'.
+            (begin
+              (local-output "Service ~a terminated."
+                            (canonical-name serv))
+              (slot-set! serv 'waiting-for-termination? #f)))
+        (begin
+          (local-output "Service ~a has been disabled."
+                        (canonical-name serv))
+          (when (respawn? serv)
+            (local-output "  (Respawning too fast.)"))
+          (slot-set! serv 'enabled? #f)))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1177,18 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running 
(canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
             (service (version 0)
               (provides (foo)) (requires ())
               (respawn? #t) (docstring \"Foo!\")
-              (enabled? #t) (running 42) (conflicts ())
+              (enabled? #t) (running abc) (conflicts ())
               (last-respawns ()))
             (service (version 0)
               (provides (bar)) (requires (foo))
-- 
2.16.1

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]