bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17561: Emacs can forget processes


From: Jorgen Schaefer
Subject: bug#17561: Emacs can forget processes
Date: Mon, 26 May 2014 20:49:52 +0200

On Mon, 26 May 2014 10:08:38 -0700
Paul Eggert <eggert@cs.ucla.edu> wrote:

> Jorgen Schaefer wrote:
> > The trace starts with a number of these triplets, which seem to be
> > "Emacs behaving normally".
> >
> > 16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [],
> > NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635> 16300
> > 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0
> > <0.000011> 16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH
> > IO], NULL, 8) = 0 <0.000010>
> 
> I don't observe this behavior when running Emacs on Fedora 20 (this
> is the latest emacs-24 version, running your little test).  I ran:
> 
> strace -o /tmp/tr -fttT src/bootstrap-emacs -nw -Q
> 
> I did see this:
> 
> 29589 09:44:10.979952 ioctl(4, FIONREAD, [0]) = 0 <0.000023>
> 29589 09:44:10.980031 ioctl(4, FIONREAD, [0]) = 0 <0.000024>
> 29589 09:44:10.980143 pselect6(6, [4 5], [], NULL, {0, 499810466}, 
> {NULL, 8}) = 0 (Timeout) <0.500499>
> 29589 09:44:11.480745 poll([{fd=5, events=POLLIN}], 1, 0) = 0
> (Timeout) <0.000030>
> 29589 09:44:11.480861 read(5, 0x7fff777b3820, 16) = -1 EAGAIN
> (Resource temporarily unavailable) <0.000029>
> 29589 09:44:11.481214 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8)
> = 0 <0.000030>
> 29589 09:44:11.481505 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL,
> 8) = 0 <0.000027>
> 29589 09:44:11.481594 ioctl(4, FIONREAD, [0]) = 0 <0.000030>
> 
> but that's not really the same. Could you please try running emacs
> -nw -Q with your test case, and see whether it behaves like the
> pattern on my platform?  If so, we might try to investigate why Emacs
> changes from this pattern to the pattern that you observe.

In "normal mode" (with no processes running), Emacs simply hangs in a
pselect6 with a "pretty long timeout". I have attached the full
traceback of the emacs -Q session.

> > Then there's a large bunch of syscalls related to my command to
> > restart the process, with the "\r" now being me sending the M-x
> > command:
> 
> Which M-x command was that?  M-x list-processes?

M-x elpy-rpc-restart

https://github.com/jorgenschaefer/elpy/blob/master/elpy.el#L1654-1662

The call to `kill-process' was added only recently as part of
debugging this problem, the bug showed up with only the `kill-buffer'
call, too.

The Elpy RPC processes are where I notice this bug the most, as there
are often quite a few of them, and I kill them regularly during development,
but I have seen this with e.g. openssl processes, too.

> > 16300 09:41:28.298391 read(3, "\r", 1)  = 1 <0.000012>
> > 16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
> > 16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
> > 16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021>
> > 16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235>
> > 16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL,
> > 8) = 0 <0.000017> 16300 09:41:28.321841 write(3, "\33[K\33[H\n\n",
> > 8) = 8 <0.000018> 16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK,
> > [WINCH IO], NULL, 8) = 0 <0.000012>
> 
> When I try your test, the child process is running in parallel with
> the parent, the 'kill' terminates the child, and the parent is
> signaled.  In contrast your trace shows no child, leading me to guess
> that the child has already exited (so the parent is killing a
> zombie), which means it's not the test case you sent but some other
> process (since the test case you sent waits on a pty so the child
> shouldn't exit).

That was for the already-existing process that exhibited the problem. I
wasn't able to reproduce it with the exact same code I sent this time.
Sorry, I should have been more clear about this.

> > Killing the process via the process list (hence the "d"):
> 
> You typed "d" in the window generated by M-x list-processes?  When I
> try that, it says "d is undefined".

"d runs the command process-menu-delete-process"

This is on a recent trunk version. It seems this was added after the
24.4 branch. (As I can't reliably reproduce this, I have to use
whichever Emacs I happen to be running at that point.)

I have also attached a file the-bug.txt.gz, which shows the strace of a
running Emacs session which starts to exhibit the bug, so we now have a
trace that includes the time when the bug actually happens. It shows me
starting Python subprocesses (elpy RPC processes) and killing them, in
an attempt to reproduce the problem. Eventually, this succeeds, and
Emacs goes into the loop reading from fd 15. I snipped a few megabytes
of logs there (marked "[...]"). Interestingly, I did not kill anything
there, I ran my "reproduction case", and Emacs cleaned up the broken
process by itself.

If I read this correctly, the "broken" process in question is 27012
(16300 is emacs). It gets killed at some point, right after a RET is
read (the elpy-rpc-restart), and goes straight into the loop:

16300 20:01:25.347174 kill(4294940284, SIGKILL <unfinished ...>
27012 20:01:25.349483 +++ killed by SIGKILL +++
16300 20:01:25.349515 <... kill resumed> ) = 0 <0.002304>
16300 20:01:25.349649 kill(4294940284, SIGHUP) = 0 <0.000019>
16300 20:01:25.354299 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) =
0 <0.0000 16300 20:01:25.354585 write(3, "\33[K\33[16;1H", 10) = 10
<0.000025> 16300 20:01:25.354687 rt_sigprocmask(SIG_UNBLOCK, [WINCH
IO], NULL, 8) = 0 <0.00 16300 20:01:25.354753 --- SIGIO (I/O possible)
@ 0 (0) --- 16300 20:01:25.354789 rt_sigreturn(0x1d) = 0 <0.000016>
16300 20:01:25.354862 ioctl(3, FIONREAD, [0]) = 0 <0.000017>
16300 20:01:25.355015 ioctl(3, FIONREAD, [0]) = 0 <0.000017>
16300 20:01:25.355098 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 670925
16300 20:01:25.355222 read(15, "", 4096) = 0 <0.000014>
16300 20:01:25.355455 ioctl(3, FIONREAD, [0]) = 0 <0.000017>
16300 20:01:25.355541 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 
626503 
16300 20:01:25.355642 read(15, "", 4096) = 0 <0.000015>
16300 20:01:25.355710 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 609628
16300 20:01:25.355804 read(15, "", 4096) = 0 <0.000015>
16300 20:01:25.355869 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 593577
16300 20:01:25.355962 read(15, "", 4096) = 0 <0.000014>
16300 20:01:25.356027 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 577782
16300 20:01:25.356156 read(15, "", 4096) = 0 <0.000015>

Comparing this with the "normal" code, I do not see a SIGCHLD. Emacs does a
SIGKILL (delete-process) followed by a SIGHUP (from killing the buffer?), but
there's no SIGCHLD. The last rt_sigprocmask with SIG_SETMASK before
that sets QUIT ALRM PROF, so that does not seem to be the culprit.

Strange stuff.

(And I see you noticed the discussion emacs-devel about epa :-))

Regards,
Jorgen

Attachment: emacs-Q-trace.txt
Description: Text document

Attachment: the-bug.txt.gz
Description: application/gzip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]