On 20.01.21 15:44, Max Reitz wrote:
On 20.01.21 15:34, Max Reitz wrote:
[...]
From a glance, it looks to me like two coroutines are created
simultaneously in two threads, and so one thread sets up a special
SIGUSR2 action, then another reverts SIGUSR2 to the default, and then
the first one kills itself with SIGUSR2.
Not sure what this has to do with backup, though it is interesting
that backup_loop() runs in two threads. So perhaps some AioContext
problem.
Oh, 256 runs two backups concurrently. So it isn’t that interesting,
but perhaps part of the problem still. (I have no idea, still looking.)
So this is what I found out:
coroutine-sigaltstack, when creating a new coroutine, sets up a signal
handler for SIGUSR2, then kills itself with SIGUSR2, then uses the
signal handler context (with a sigaltstack) for the new coroutine, and
then (the signal handler returns after a sigsetjmp()) the old SIGUSR2
behavior is restored.
What I fail to understand is how this is thread-safe. Setting up signal
handlers is a process-wide action. When one thread changes what SIGUSR2
does, this will affect all threads immediately, so when two threads run
coroutine-sigaltstack’s qemu_coroutine_new() concurrently, and one
thread reverts to the default action before the other has SIGUSR2’ed
itself, that later SIGUSR2 will kill the whole process.
(I suppose it gets even more interesting when one thread has set up the
sigaltstack, then the other sets up its own sigaltstack, and then both
kill themselves with SIGUSR2, so both coroutines get the same stack...)
I have no idea why this has never been hit before, but it makes sense
why block-copy backup makes it apparent: It creates 64+x coroutines in a
very short time span, and 256 makes it do so in two threads concurrently
(thanks to launching two backups in two AioContexts in a transaction).
So... Looks to me like a bug in coroutine-sigaltstack. Not sure what
to do now, though. I don’t think we can use block-copy for backup
before that backend is fixed. (And that is assuming that it’s indeed
coroutine-sigaltstack’s fault.)
I’ll try to add some locking, see what it does, and send a mail
concerning coroutine-sigaltstack to qemu-devel.