[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] AioContext: fix broken ctx->dispatching optimiz
From: |
Stefan Hajnoczi |
Subject: |
Re: [Qemu-devel] [PATCH] AioContext: fix broken ctx->dispatching optimization |
Date: |
Thu, 16 Jul 2015 11:17:35 +0100 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
On Wed, Jul 15, 2015 at 07:13:23PM +0200, Paolo Bonzini wrote:
> This patch rewrites the ctx->dispatching optimization, which was the cause
> of some mysterious hangs that could be reproduced on aarch64 KVM only.
> The hangs were indirectly caused by aio_poll() and in particular by
> flash memory updates's call to blk_write(), which invokes aio_poll().
> Fun stuff: they had an extremely short race window, so much that
> adding all kind of tracing to either the kernel or QEMU made it
> go away (a single printf made it half as reproducible).
>
> On the plus side, the failure mode (a hang until the next keypress)
> made it very easy to examine the state of the process with a debugger.
> And there was a very nice reproducer from Laszlo, which failed pretty
> often (more than half of the time) on any version of QEMU with a non-debug
> kernel; it also failed fast, while still in the firmware. So, it could
> have been worse.
>
> For some unknown reason they happened only with virtio-scsi, but
> that's not important. It's more interesting that they disappeared with
> io=native, making thread-pool.c a likely suspect for where the bug arose.
> thread-pool.c is also one of the few places which use bottom halves
> across threads, by the way.
>
> I hope that no other similar bugs exist, but just in case :) I am
> going to describe how the successful debugging went... Since the
> likely culprit was the ctx->dispatching optimization, which mostly
> affects bottom halves, the first observation was that there are two
> qemu_bh_schedule() invocations in the thread pool: the one in the aio
> worker and the one in thread_pool_completion_bh. The latter always
> causes the optimization to trigger, the former may or may not. In
> order to restrict the possibilities, I introduced new functions
> qemu_bh_schedule_slow() and qemu_bh_schedule_fast():
>
> /* qemu_bh_schedule_slow: */
> ctx = bh->ctx;
> bh->idle = 0;
> if (atomic_xchg(&bh->scheduled, 1) == 0) {
> event_notifier_set(&ctx->notifier);
> }
>
> /* qemu_bh_schedule_fast: */
> ctx = bh->ctx;
> bh->idle = 0;
> assert(ctx->dispatching);
> atomic_xchg(&bh->scheduled, 1);
>
> Notice how the atomic_xchg is still in qemu_bh_schedule_slow(). This
> was already debated a few months ago, so I assumed it to be correct.
> In retrospect this was a very good idea, as you'll see later.
>
> Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't
> trigger the assertion (as expected). Changing the worker's invocation
> to qemu_bh_schedule_slow() didn't hide the bug (another assumption
> which luckily held). This already limited heavily the amount of
> interaction between the threads, hinting that the problematic events
> must have triggered around thread_pool_completion_bh().
>
> As mentioned early, invoking a debugger to examine the state of a
> hung process was pretty easy; the iothread was always waiting on a
> poll(..., -1) system call. Infinite timeouts are much rarer on x86,
> and this could be the reason why the bug was never observed there.
> With the buggy sequence more or less resolved to an interaction between
> thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was
> to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping
> that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and
> qemu_bh_schedule_fast() would provide some hint. The output was:
>
> (gdb) p last_prepare
> $3 = 103885451
> (gdb) p last_dispatch
> $4 = 103876492
> (gdb) p last_poll
> $5 = 115909333
> (gdb) p last_schedule
> $6 = 115925212
>
> Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch().
> This makes little sense unless there is an aio_poll() call involved,
> and indeed with a slightly different instrumentation you can see that
> there is one:
>
> (gdb) p last_prepare
> $3 = 107569679
> (gdb) p last_dispatch
> $4 = 107561600
> (gdb) p last_aio_poll
> $5 = 110671400
> (gdb) p last_schedule
> $6 = 110698917
>
> So the scenario becomes clearer:
>
> iothread VCPU thread
> --------------------------------------------------------------------------
> aio_ctx_prepare
> aio_ctx_check
> qemu_poll_ns(timeout=-1)
> aio_poll
> aio_dispatch
> thread_pool_completion_bh
> qemu_bh_schedule()
>
> At this point bh->scheduled = 1 and the iothread has not been woken up.
> The solution must be close, but this alone should not be a problem,
> because the bottom half is only rescheduled to account for rare situations
> (see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll()
> calls, 2014-07-15).
>
> Introducing a third thread---a thread pool worker thread, which
> also does qemu_bh_schedule()---does bring out the problematic case.
> The third thread must be awakened *after* the callback is complete and
> thread_pool_completion_bh has redone the whole loop, explaining the
> short race window. And then this is what happens:
>
> thread pool worker
> --------------------------------------------------------------------------
> <I/O completes>
> qemu_bh_schedule()
>
> Tada, bh->scheduled is already 1, so qemu_bh_schedule() does nothing
> and the iothread is never woken up. This is where the bh->scheduled
> optimization comes into play---it is correct, but removing it would
> have masked the bug.
>
> So, what is the bug?
>
> Well, the question asked by the ctx->dispatching optimization ("is any
> active aio_poll dispatching?") was wrong. The right question to ask
> instead is "is any active aio_poll *not* dispatching", i.e. in the prepare
> or poll phases? In that case, the aio_poll is sleeping or might go to
> sleep anytime soon, and the EventNotifier must be invoked to wake
> it up.
>
> In any other case (including if there is *no* active aio_poll at all!)
> we can just wait for the next prepare phase to pick up the event (e.g. a
> bottom half); the prepare phase will avoid the blocking and service the
> bottom half.
>
> Expressing the invariant with a logic formula, the broken one looked like:
>
> !(exists(thread): in_dispatching(thread)) => !optimize
>
> or equivalently:
>
> !(exists(thread):
> in_aio_poll(thread) && in_dispatching(thread)) => !optimize
>
> In the correct one, the negation is in a slightly different place:
>
> (exists(thread):
> in_aio_poll(thread) && !in_dispatching(thread)) => !optimize
>
> or equivalently:
>
> (exists(thread): in_prepare_or_poll(thread)) => !optimize
>
> Even if the difference boils down to moving an exclamation mark :)
> the implementation is quite different. However, I think the new
> one is simpler to understand.
>
> In the old implementation, the "exists" was implemented with a boolean
> value. This didn't really support well the case of multiple concurrent
> event loops, but I thought was that this was okay: aio_poll holds the
> AioContext lock so there cannot be concurrent aio_poll invocations, and
> I was just considering nested event loops. However, aio_poll _could_
> indeed be concurrent with the GSource. This is why I came up with the
> wrong invariant.
>
> In the new implementation, "exists" is computed simply by counting how many
> threads are in the prepare or poll phases. There are some interesting
> points to consider, but the gist of the idea remains:
>
> 1) AioContext can be used through GSource as well; as mentioned in the
> patch, bit 0 of the counter is reserved for the GSource.
>
> 2) the counter need not be updated for a non-blocking aio_poll, because
> it won't sleep forever anyway. This is just a matter of checking
> the "blocking" variable. This requires some changes to the win32
> implementation, but is otherwise not too complicated.
>
> 3) as mentioned above, the new implementation will not call aio_notify
> when there is *no* active aio_poll at all. The tests have to be
> adjusted for this change. The calls to aio_notify in async.c are fine;
> they only want to kick aio_poll out of a blocking wait, but need not
> do anything if aio_poll is not running.
>
> 4) nested aio_poll: these just work with the new implementation; when
> a nested event loop is invoked, the outer event loop is never in the
> prepare or poll phases. The outer event loop thus has already decremented
> the counter.
>
> Reported-by: Richard W. M. Jones <address@hidden>
> Reported-by: Laszlo Ersek <address@hidden>
> Signed-off-by: Paolo Bonzini <address@hidden>
> ---
> aio-posix.c | 19 ++++++------
> aio-win32.c | 33 ++++++++++-----------
> async.c | 21 +++++---------
> docs/aio_notify.promela | 77
> +++++++++++++++++++++----------------------------
> include/block/aio.h | 29 +++++++++++++++----
> tests/test-aio.c | 26 +++--------------
> 6 files changed, 91 insertions(+), 114 deletions(-)
Looks good for QEMU 2.4:
Reviewed-by: Stefan Hajnoczi <address@hidden>
I'll merge it for the next -rc unless there is any more discussion.
pgphFRJAbDKEl.pgp
Description: PGP signature