[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Thread safety of coroutine-sigaltstack
From: |
Max Reitz |
Subject: |
Thread safety of coroutine-sigaltstack |
Date: |
Wed, 20 Jan 2021 17:26:46 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 |
Hi,
I’ve run into trouble with Vladimir’s async backup series on MacOS,
namely that iotest 256 fails with qemu exiting because of a SIGUSR2.
Turns out this is because MacOS (-xcode) uses coroutine-sigaltstack,
when I use this on Linux, I get the same error.
(You can find the series applied on my block branch e.g. here:
https://github.com/XanClic/qemu.git block
)
Some debugging later I found that the problem seems to be two threads
simultaneously creating a coroutine. It makes sense that this case
would appear with Vladimir’s series and iotest 256, because 256 runs two
backup jobs in two different threads in a transaction, i.e. they’re
launched simultaneously. The async backup series makes backup use many
concurrent coroutines and so by default launches 64+x coroutines when
the backup is started. Thus, the case of two coroutines created
concurrently in two threads is very likely to occur.
I think the problem is in coroutine-sigaltstack’s qemu_coroutine_new().
It sets up a SIGUSR2 handler, then changes the signal handling stack,
then raises SIGUSR2, then reverts the signal handling stack and the
SIGUSR2 handler. As far as I’m aware, setting up signal handlers and
changing the signal handling stack are both process-global operations,
and so if two threads do so concurrently, they will interfere with each
other. What usually happens is that one thread sets up everything,
while the other is already in the process of reverting its changes: So
the second thread reverts the SIGUSR2 handler to the default, and then
the first thread raises SIGUSR2, thus making qemu exit.
(Could be worse though. Both threads could set up the sigaltstack, then
both raise SIGUSR2, and then we get one coroutine_trampoline()
invocation in each thread, but both would use the same stack. But I
don’t think I’ve ever seen that happen, presumably because the race time
window is much shorter.)
Now, this all seems obvious to me, but I’m wondering... If
coroutine-sigaltstack really couldn’t create coroutines concurrently,
why wouldn’t we have noticed before? I mean, this new backup case is
kind of a stress test, yes, but surely we would have seen the problem
already, right? That’s why I’m not sure whether my analysis is correct.
Anyway, I’ve attached a patch that wraps the whole SIGUSR2 handling
section in a mutex, and that makes 256 pass reliably with Vladimir’s
async backup series. Besides being unsure whether the problem is really
in coroutine-sigaltstack, I also don’t know whether getting out the big
guns and wrapping everything in the mutex is the best solution. So,
it’s an RFC, I guess.
Max
0001-coroutine-sigaltstack-Add-SIGUSR2-mutex.patch
Description: Text Data
- Thread safety of coroutine-sigaltstack,
Max Reitz <=
- Re: Thread safety of coroutine-sigaltstack, Paolo Bonzini, 2021/01/20
- Re: Thread safety of coroutine-sigaltstack, Eric Blake, 2021/01/20
- Re: Thread safety of coroutine-sigaltstack, Laszlo Ersek, 2021/01/20
- Re: Thread safety of coroutine-sigaltstack, Paolo Bonzini, 2021/01/21
- Re: Thread safety of coroutine-sigaltstack, Daniel P . Berrangé, 2021/01/21