[Dejagnu-commit] [SCM] DejaGNU branch, queue, created. dejagnu_1_4

dejagnu-commit
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Dejagnu-commit] [SCM] DejaGNU branch, queue, created. dejagnu_1_4_3-730

From:	Jacob Bachmeyer
Subject:	[Dejagnu-commit] [SCM] DejaGNU branch, queue, created. dejagnu_1_4_3-730-g06f755f
Date:	Thu, 25 Jun 2020 19:43:35 -0400 (EDT)
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "DejaGNU".

The branch, queue has been created
        at  06f755fd5ef68d0309e2dd96f80e6634981c1f9b (commit)

- Log -----------------------------------------------------------------
commit 06f755fd5ef68d0309e2dd96f80e6634981c1f9b
Merge: 494df2b ed7ed28
Author: Jacob Bachmeyer <jcb62281+dev@gmail.com>
Date:   Thu Jun 25 17:49:38 2020 -0500

    Merge branch 'timeout-fix-for-1.6.3'
    
    Conflicts:
        ChangeLog

commit 494df2b73f67dadf0ad58ac540d737c9083875c8
Merge: c70b720 8c750f7
Author: Jacob Bachmeyer <jcb62281+dev@gmail.com>
Date:   Thu Jun 25 17:48:28 2020 -0500

    Merge branch 'new-api-for-1.6.3'
    
    Conflicts:
        ChangeLog
        NEWS
        doc/dejagnu.texi

commit c70b720ac4e1bd32d447f7727d6e8c0f84c1ba4f
Merge: c197ab9 024297c
Author: Jacob Bachmeyer <jcb62281+dev@gmail.com>
Date:   Thu Jun 25 17:43:38 2020 -0500

    Merge branch 'gdb-upstream-for-1.6.3'
    
    Conflicts:
        ChangeLog

commit ed7ed2891c051f8114db1a87b36aefd120fbfc0f
Author: Jacob Bachmeyer <jcb62281+dev@gmail.com>
Date:   Mon Jun 22 23:37:21 2020 -0500

    Update ChangeLog

commit 1b09d3a7b9912aab0a3e3ba2a63ebaa8e61f3238
Author: Maciej W. Rozycki <macro@wdc.com>
Date:   Thu Jun 11 02:31:02 2020 +0100

    remote: Fix a stuck remote call pipeline causing testing to hang
    
    Fix a stuck remote call pipeline comprised of multiple processes causing
    testing to hang and requiring a manual intervention to either terminate
    or proceed, like below (here with the GCC `c' testsuite invoked with
    `execute.exp=postmod-1.c' for 8 compilation and 8 execution tests on a
    remote QEMU target run in the system emulation mode):
    
    PASS: gcc.c-torture/execute/postmod-1.c   -O0  (test for excess errors)
    Executing on remote-localhost: .../gcc/testsuite/gcc/postmod-1.exe    
(timeout = 15)
    spawn [open ...]
    WARNING: program timed out
    got a INT signal, interrupted by user
    
                === gcc Summary ===
    
    # of expected passes                1
    
    by not killing the pending force-kills in `close_wait_program' and also
    by setting the channel associated with the pipeline to the nonblocking
    mode when it is about to be closed afterwards.
    
    The situation here is as follows.  A connection to the remote target
    board is requested by `rsh_exec' with input redirection requested from
    `/dev/null'.  The request is handled by `local_exec' and the redirection
    causes a Tcl command pipeline channel to be opened.  The list of PIDs of
    the processes comprising the pipeline is determined and then the channel
    is assigned an Expect spawn ID.  The spawn ID is then waited for output
    produced by the remote target (here accessed with SSH) and, ultimately,
    completion marked by the end-of-file condition.
    
    As SSH gets stuck and does not complete the timeout eventually fires and
    a kill sequence is initiated, by calling `close_wait_program' with the
    list of PIDs previously obtained to kill given as one of the procedure's
    arguments.  Seeing the list of PIDs rather than -1 `close_wait_program'
    issues SIGINT to all the requested processes right away and schedules a
    delayed sequence called "force-kills" to them, which sends SIGTERM and
    then, after a further delay, SIGKILL.
    
    Now `close_wait_program' calls `close' on the spawn ID associated with
    the pipeline, but this call doesn't affect the pipeline as its input has
    been redirected from `/dev/null'.  As the next step `wait' is called on
    the same spawn ID and returns successfully right away with a result like
    {0 exp8 0 0} in `wres', where no PID is indicated, consistently with the
    null PID result of the original `spawn' command that assigned the spawn
    ID (`exp8' here) to the pipeline.  The return from the `wait' command
    causes code to be executed for the pending force-kills to be killed.
    
    At this point the process situation is like below:
    
      PID TTY      STAT   TIME COMMAND
     6908 pts/3    Sl     0:00 expect -- .../share/dejagnu/runtest.exp --tool 
gcc --target_board remote-localhost execute.exp=postmod-1.c
     6976 pts/3    S      0:00  \_ ssh -p 2222 -l macro localhost sh -c 
'.../gcc/testsuite/gcc/postmod-1.exe ; echo XYZ${?}ZYX'
     6977 pts/3    Z      0:00  \_ [cat] <defunct>
     6991 pts/3    Z      0:00  \_ [sh] <defunct>
    
    so `cat' and `sh' have already terminated, the former presumably due to
    SIGINT sent previously and the latter having been the force-kills just
    killed, and only await being wait(2)ed for, however `ssh' is still live
    and in the interruptible sleep, presumably awaiting communication with
    the remote end.
    
    Since there is nothing else to do for `close_wait_program' it returns
    success to `local_exec', which then calls `close' on the pipeline to
    clean up after it.  But that in turn causes wait(2) to be called on the
    individual PIDs comprising the pipeline and when the PID associated with
    `ssh' the call hangs indefinitely preventing the whole testsuite from
    proceeding.
    
    A similar situation triggers with GDB testing where a Tcl command
    pipeline channel is opened in `remote_spawn' instead, and then closed,
    after `close_wait_program' has been called, in `standard_close'.
    
    So the solution to the problem is twofold.  First pending force-kills
    are not killed after `wait' if there are more than one PID in the list
    passed to `close_wait_program'.  This follows the observation that if
    there was only one PID on the list, then the process must have been
    created directly by `spawn' rather than by assigning a spawn ID to a
    pipeline and the return from `wait' would mean the process associated
    with the PID must have already been cleaned up after, so it is only when
    there are more there is a possibility any may have been left behind
    live.
    
    Second if a pipeline has been used, then the channel associated with the
    pipeline is set to the nonblocking mode in case any of the processes
    that may have left live is stuck in the noninterruptible sleep (aka D)
    state.  Such a process would necessarily ignore even SIGKILL so long as
    it remains in that state and would cause wait(2) called by `close' to
    hang possibly indefinitely, and we want the testsuite to proceed rather
    than hang even in bad circumstances.
    
    Finally it appears to be safe to leave pending force-kills to complete
    their job after `wait' has been called in `close_wait_program', because
    based on the observation made here the command does not actually call
    wait(2) if issued on a spawn ID associated with a pipeline created by
    `open' rather than a process created by `spawn'.  Instead the PIDs from
    a pipeline are supposed to be cleaned up after by calling wait(2) from
    the `close' command call made on the pipeline channel.  If on the other
    hand the channel is set to the nonblocking mode before `close', then
    even that command does not call wait(2) on the associated PIDs.
    
    Therefore the PIDs on the list passed are not subject to PID reuse and
    the force-kills won't accidentally kill an unrelated process, as a PID
    cannot be allocated by the kernel for a new process until any previous
    process's status has been consumed from its PID by wait(2).  And then
    PIDs of any children that have actually terminated one way or another
    are wait(2)ed for by Tcl automatically in the event loop, so no mess is
    left behind.
    
        * lib/remote.exp (close_wait_program): Only kill the pending
        force-kills if the PID list has a single entry.
        (local_exec): Set the channel about to be closed to the
        nonblocking mode if we didn't see an EOF.
        (standard_close): Likewise, unconditionally.
    
    Signed-off-by: Maciej W. Rozycki <macro@wdc.com>

commit 04668d6771f583c5b0a782e075acb71191c33b55
Author: Maciej W. Rozycki <macro@wdc.com>
Date:   Thu Jun 11 02:30:44 2020 +0100

    remote: Use `catch' in killing pending force-kills
    
    Address an execution race in `close_wait_program' and use `catch' in
    killing pending force-kills issued there in the recovery of a stuck test
    case, in case the force-kill sequence has completed before the command
    to kill the sequence had a chance to run, so that no error is thrown and
    a testsuite run does not get interrupted early like:
    
    PASS: gcc.c-torture/execute/postmod-1.c   -O0  (test for excess errors)
    Executing on remote-localhost: .../gcc/testsuite/gcc/postmod-1.exe    
(timeout = 15)
    spawn [open ...]
    WARNING: program timed out
    ERROR: tcl error sourcing 
.../gcc/testsuite/gcc.c-torture/execute/execute.exp.
    ERROR: child process exited abnormally
        while executing
    "exec sh -c "exec > /dev/null 2>&1 && kill -9 $exec_pid""
        (procedure "close_wait_program" line 57)
        invoked from within
    "close_wait_program $spawn_id $pid wres"
        (procedure "local_exec" line 104)
    [...]
    "uplevel #0 source .../gcc/testsuite/gcc.c-torture/execute/execute.exp"
        invoked from within
    "catch "uplevel #0 source $test_file_name""
    testcase .../gcc/testsuite/gcc.c-torture/execute/execute.exp completed in 
196 seconds
    
                === gcc Summary ===
    
    # of expected passes                1
    
    -- therefore not letting `execute.exp' continue (here with the GCC `c'
    testsuite invoked with `execute.exp=postmod-1.c' for 8 compilation and 8
    execution tests).
    
    The completion of the force-kill sequence would have to happen in the
    window between the `wait' command has returned, which would at worst
    happen as a result of the final `kill -9' command in the sequence, and
    the `kill -9 $exec_pid' command issued here, and the `sleep 5' command
    issued at the end of the force-kill sequence makes the likelihood of
    such a scenario low, but this might still happen with a loaded host
    system and there is no drawback from using `catch' here, so let's do it.
    
        * lib/remote.exp (close_wait_program): Use `catch' in killing
        pending force-kills.
    
    Signed-off-by: Maciej W. Rozycki <macro@wdc.com>

-----------------------------------------------------------------------


hooks/post-receive
-- 
DejaGNU
[Prev in Thread]
Current Thread
[Next in Thread]
[Dejagnu-commit] [SCM] DejaGNU branch, queue, created. dejagnu_1_4_3-730-g06f755f, Jacob Bachmeyer <=
Prev by Date: [Dejagnu-commit] [SCM] DejaGNU branch, PR41918, updated. dejagnu_1_4_3-716-g5bc0f51
Next by Date: [Dejagnu-commit] [SCM] DejaGNU branch, PR41918, updated. dejagnu_1_4_3-720-g61dc0ca
Previous by thread: [Dejagnu-commit] [SCM] DejaGNU branch, PR41918, updated. dejagnu_1_4_3-716-g5bc0f51
Next by thread: [Dejagnu-commit] [SCM] DejaGNU branch, PR41918, updated. dejagnu_1_4_3-720-g61dc0ca
Index(es):
- Date
- Thread