[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Crash robustness (Was: Re: Dynamic modules: MODULE_HANDLE_SIGNALS etc.)

From: Daniel Colascione
Subject: Crash robustness (Was: Re: Dynamic modules: MODULE_HANDLE_SIGNALS etc.)
Date: Wed, 23 Dec 2015 08:25:51 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0

On 12/23/2015 08:07 AM, Eli Zaretskii wrote:
>> Cc: address@hidden, address@hidden,
>>  address@hidden, address@hidden, address@hidden
>> From: Daniel Colascione <address@hidden>
>> Date: Tue, 22 Dec 2015 13:18:21 -0800
>>>> Which is why you setjmp in places where you have a significant stack
>>>> reserve.
>>> There's no way of doing that portably, or even non-portably on many
>>> platforms.  You simply don't _know_ how much stack is left.
>> You can probe at program start and pre-allocate as much as is reasonable.
> Pre-allocate what?  Are you suggesting that Emacs allocates its own
> stack, instead of relying on the one provided by the linker and the
> OS?

We can alloca, say, 8MB, and write to the start and end of the allocated
region. Then we'll know we have at least that much stack space available.

>>>> Longjmp, by itself, is simple and clear. What's unreliable is longjmping
>>>> to Lisp at completely arbitrary points in the program, even ones marked
>>>> "GC can't happen here" and the like.
>>> We longjmp to a particular place, not arbitrary place.
>> But we longjmp _from_ anywhere, and "anywhere" might be in the middle of
>> any delicate code sequence, since the compiler can generate code to
>> write to new stack slots at any point.
> I simply don't see any trouble this could cause, except leaking some
> memory.  Can you describe in enough detail a single use case where
> this could have any other adverse effects that we should care about
> when recovering from stack overflow?

What happens if we overflow inside malloc? One possibility is that we'll
longjmp back to toplevel without releasing the heap lock, then deadlock
the next time we try to allocate.

>>>> You say Emacs shouldn't crash.  Fine. We can't make that guarantee
>>>> if the crash recovery code breaks program invariants.
>>> Crash recovery doesn't need to keep invariants.  Or maybe I
>>> misunderstand what invariants do you have in mind.
>> Any stack allocation anywhere in the program can longjmp. It's
>> impossible to reason about safety in that situation.
> Emacs is not safety-critical software, so there's no requirement to
> reason about safety.  Since I think the recovery's only role is to
> allow the user to exit Emacs in a controlled way without losing work,
> I simply don't see any problem that could be caused by longjmping from
> an arbitrary stack allocation.  After all, stack allocation is just
> assignment of value to a register, and sometimes grafting a range of
> memory pages into the memory set.
>>>> Failing that, we should allocate guard pages, unprotect the guard
>>>> pages on overflow
>>> Thats what the OS is for.  It would be wrong for us to start messing
>>> with page protection etc.  The exception caused by stack overflow
>>> removes protection from the guard page to let you do something simple,
>>> like run the exception handler -- are you suggesting we catch the
>>> exception and mess with protection bits as well, i.e. replace one of
>>> the core functions of a modern OS?  All that because what we have now
>>> is not elegant enough for us?  Doesn't sound right to me.
>> We have a program that has its own Lisp runtime, has its own memory
>> allocation system, uses its own virtual filesystem access layer, and
>> that brings itself back from the dead. We're well past replicating OS
>> functionality.
> Actually, most of the above is simply untrue: we use system allocators
> to allocate memory

We have internal allocators for strings and conses and use the system
allocator only for backing storage.

> use mundane C APIs like 'open' and 'read' to
> access files

We must.

, and if by "bringing itself from the dead" you allude to
> unexec, then what it does is a subset of what every linker does,
> hardly an OS stuff.

Granted, that's toolchain work, not "OS" work, but it's still outside
the domain of most text editors.

> I think we should strive to distance ourselves from the OS business,
> not the other way around.  There was time when doing complex things
> sometimes required messing with low-level functionality like that, but
> that time is long passed.  Allocating our own stack, setting up and
> managing our own guard pages and the related exceptions -- we
> shouldn't go back there.

If an OS provides a documented and supported facility, there's no shame
in using it. I'm not sure how worrying about whatever that facility is
"OS business" is useful.

>> It's not a matter of elegance: it's a matter of correctness. The current
>> scheme is unsafe.
> Emacs is not safety-critical software.  It doesn't need to be "safe"
> by your definition, if I understand it correctly.

It's not safety-critical software, but undefined behavior is undefined.
What makes us confident that we can't corrupt buffer data by longjmping
from the wrong place? Anything can happen because we can longjmp from

It's admirable to avoid the loss of user data, but I think there's a way
that's both safer and more general. Instead of trying to catch stack
overflow, let's treat stack overflow as a normal fatal error and instead
think about how we can preserve buffer contents on fatal errors generally.

What if we just installed a SIGSEGV handler (or, on Windows, a vectored
exception handler) that wrote buffer contents to a special file on a
fatal signal, then allowed that fatal signal to propagate normally? The
next time Emacs starts, we can restore the buffers we've saved this way
and ask users to save them --- just like autosave, but done on-demand,
at crash time, in C code, on the alternate signal stack.

>>>> and call out_of_memory so that it's obvious Emacs is in a bad
>>>> state. This way, we don't have to longjmp out of arbitrary code
>>>> sequences.
>>> There's no problem longjmping out of arbitrary code sequences.  When
>>> you debug a program, you do that all the time.
>> In GDB, interrupting normal control flow is not part of standard
>> debugging practice.
> ??? Every time a debuggee hits a breakpoint, the normal control flow
> is interrupted, and you in effect have a huge longjmp -- from the
> debuggee to the debugger.

When a program hits a breakpoint, the OS sends it a signal. A debugger
that's ptraced its debugee will receive that signal, suspend execution,
and give control to the user. If the user opts to continue execution,
the debugger restores the debugee to the state it was in when it
received the signal, then allows is to resume execution.

At no point does the debugger force a debugee to longjmp. Debuggers take
pains to make programs behave as if breakpoints weren't there at all. We
don't try to resume execution at the point of a stack overflow.

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]