guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services


From: Carlo Zancanaro
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Mon, 05 Mar 2018 09:35:58 +1100
User-agent: mu4e 1.0; emacs 25.3.1


On Sun, Mar 04 2018, Ludovic Courtès wrote:
Good catch. We could add this in gnu-build-system.scm in core-updates, though it’s no big deal anyway since these are throw-away environments.

Thoughts?

The current forking-service.sh test fails in that environment, so we won't be able to build shepherd on Hurd, or systems with Linux pre 3.4. This is already the case without my third commit, though, because the prctl fallback logic isn't in place yet.

I think we should add it in core-updates. It does affect the behaviour of processes within the build environment, and can lead to test failures if people rely on pid 1 to reap zombie processes (which, from what I understand, they should be able to). This could even be leading to test failures in other packages which we have just disabled.

+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)

Here everyone ends up paying some overhead (the 0.5 second timeout),
which isn’t great.

How about something like:

  (define poll-services
    (and (not (= 1 (getpid)))
         …))

  (match (select (list sock) '() '() (if poll-services 0.5 0))
    …)

The wait for 0.5 seconds is only an upper-bound for the timeout. Changing it to a 0 would actually be worse, because it would spend longer polling for running services. The `select` procedure waits for `sock` to be ready to read from. When it's ready it returns immediately, but if `sock` takes more than 0.5 seconds to be ready then it will return anyway (and take the second branch in the match, which does nothing).

This should incur no (or extremely minuscule) overhead in how long it takes to respond to a socket, but provides an opportunity every half a second (at most) for shepherd to poll the running services.

On reflection, we should also change the commit message for this commit. I have attached a patch with a more accurate commit message.

Carlo

From 5b01f79522c815dd8277298e87eef0506c2e8612 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <address@hidden>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fails.

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
* doc/shepherd.texi (Slots of services): Add note about service running slot
  being a process id.
---
 doc/shepherd.texi            |  4 ++-
 modules/shepherd.scm         | 47 ++++++++++++++++++-------
 modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 5 files changed, 95 insertions(+), 46 deletions(-)

diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 815091f..47005d5 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -608,7 +608,9 @@ way.  The default value is @code{#f}, which indicates that 
the service
 is not running. When an attempt is made to start the service, it will
 be set to the return value of the procedure in the @code{start} slot.
 It will also be passed as an argument to the procedure in the
address@hidden slot.  This slot can not be initialized with a keyword.
address@hidden slot.  If it is set a value that is an integer, it is
+assumed to be a process id, and shepherd will monitor the process for
+unexpected exits.  This slot can not be initialized with a keyword.
 
 @item
 @vindex respawn? (slot of <service>)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index faa1e47..e912d21 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,14 +51,28 @@

 ;; Main program.
 (define (main . args)
-  (initialize-cli)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f) ;; If we're pid 1 then we don't need to set
+                       ;; PR_SET_CHILD_SUBREAPER
+        (catch 'system-error
+          (lambda ()
+            ;; Register for orphaned processes to be reparented onto us when
+            ;; their original parent dies. This lets us handle SIGCHLD from
+            ;; daemon processes that would otherwise have been reparented
+            ;; under pid 1. This is unnecessary when we are pid 1.
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda args
+            ;; We fall back to polling for services on systems that don't
+            ;; support prctl/PR_SET_CHILD_SUBREAPER
+            (let ((errno (system-error-errno args)))
+              (if (or (= ENOSYS errno) ;; prctl not available
+                      (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
+                  check-for-dead-services ;; poll
+                  (apply throw args)))))))
 
-  (when (not (= 1 (getpid)))
-    ;; Register for orphaned processes to be reparented onto us when their
-    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
-    ;; that would otherwise have been reparented under pid 1. This is
-    ;; unnecessary when we are pid 1.
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+  (initialize-cli)
 
   (let ((config-file #f)
        (socket-file default-socket-file)
@@ -225,11 +241,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..056483a 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <address@hidden>
 ;; Copyright (C) 2014 Alex Sassmannshausen <address@hidden>
 ;; Copyright (C) 2016 Alex Kost <address@hidden>
+;; Copyright (C) 2018 Carlo Zancanaro <address@hidden>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1010,38 +1013,44 @@ otherwise by updating its state."
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
          (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+           (respawn-service serv))
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  "Respawn a service that has stopped running unexpectedly. If we have
+attempted to respawn the service a number of times already and it keeps dying,
+then disable it."
+  (slot-set! serv 'running #f)
+  (if (and (respawn? serv)
+           (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                    (car respawn-limit)
+                                    (cdr respawn-limit))))
+      (if (not (slot-ref serv 'waiting-for-termination?))
+          (begin
+            ;; Everything is okay, start it.
+            (local-output "Respawning ~a."
+                          (canonical-name serv))
+            (slot-set! serv 'last-respawns
+                       (cons (current-time)
+                             (slot-ref serv 'last-respawns)))
+            (start serv))
+          ;; We have just been waiting for the
+          ;; termination.  The `running' slot has already
+          ;; been set to `#f' by `stop'.
+          (begin
+            (local-output "Service ~a terminated."
+                          (canonical-name serv))
+            (slot-set! serv 'waiting-for-termination? #f)))
+      (begin
+        (local-output "Service ~a has been disabled."
+                      (canonical-name serv))
+        (when (respawn? serv)
+          (local-output "  (Respawning too fast.)"))
+        (slot-set! serv 'enabled? #f))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1180,21 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  "Poll each process that we expect to be running, and respawn any which have
+unexpectedly stopped running. This procedure is used as a fallback on systems
+where prctl/PR_SET_CHILD_SUBREAPER is unsupported."
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running 
(canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
             (service (version 0)
               (provides (foo)) (requires ())
               (respawn? #t) (docstring \"Foo!\")
-              (enabled? #t) (running 42) (conflicts ())
+              (enabled? #t) (running abc) (conflicts ())
               (last-respawns ()))
             (service (version 0)
               (provides (bar)) (requires (foo))
-- 
2.16.1

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]