qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] Handle terminating signals.


From: Ian Jackson
Subject: Re: [Qemu-devel] [PATCH] Handle terminating signals.
Date: Wed, 13 Aug 2008 14:29:17 +0100

Gerd Hoffmann writes ("Re: [Qemu-devel] [PATCH] Handle terminating signals."):
> Ian Jackson wrote:
> > No, because the program should not attempt to catch SEGV either.
> 
> Why not?  Can you change your attitude to say "no" without giving
> reasons please?

I guess this is another one of those pieces of `obvious' wisdom which
no-one previously bothered writing down.  I tried to find a clear
online reference which explains why but I couldn't find one, so I'll
try to explain it here:

If your program gets a SEGV, that means its memory may be corrupted.
Because the signal is asynchronous and may happen at any point, the
values of variables and data structures may be arbitrary and of course
inconsistent (since perhaps the program was halfway through modifying
a data structure).  The contents of the stack cannot be relied on.
Even the stack pointer may be corrupted (that might be the cause of
the crash).

Technically, _any_ attempt to resume the mainline program execution
(whether by returning or longjmping out of the SEGV handler, or by
running the same code inside the signal handler) has undefined
behaviour.  That includes _any attempt to access a global variable_
(other than a variable which is specially marked and protected, and
of course where the signal blocking is used appropriately - which is
not and cannot be the case here).

This means that attempts to `recover' or `clean up' when you get SEGV
are at least as likely to make things worse as they are to make things
better.  Furthermore bugs tend to happen in complex corner cases,
which is precisely the kind of situation where trying to continue or
tidy up after a SEGV is likely to do harm.  Even if the recovery code
doesn't actually worsen the data corruption and loss, it will make the
situation more confusing by making additional changes to the system
state.

That means it's much harder for a system administrator to recover or
repair, and also much harder for the operating system or application's
crash recovery code to cope with.  Post-crash data recovery is a
well-developed field; there are well-understood strategies adopted by
humans and computers under various circumstances.  These are likely to
be defeated by attempts to `clean up' between the detection of a fatal
corruption and actual death.

Both as a system administrators and as a programmers I curse programs
which try to trap fatal bugs like this.  It makes it harder to figure
out why the system is failing; it makes it harder to restore the
system to a working and correct state; it makes it harder to avoid
data corruption; and in the worst it can also lead to a cavalier
attitude towards bugs.

Thus the best thing to do - usually the only sane thing to do - with
the coredumping signals (SEGV, BUS, ABRT, QUIT, as applicable and
available) is to leave them unblocked and set to SIG_DFL.

(SIGQUIT is a different argument because it's only ever sent
explicitly.  But its purpose is to aid debugging and get an
instantaneous coredump, which would be defeated by catching it.)

Ian.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]