[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] blockjob: Fix crash with IOthread when block commit after sn
Re: [PATCH] blockjob: Fix crash with IOthread when block commit after snapshot
Fri, 12 Feb 2021 14:03:54 +0100
Am 03.02.2021 um 08:45 hat Vladimir Sementsov-Ogievskiy geschrieben:
> subject should start with [PATCH v5]
> 03.02.2021 05:40, firstname.lastname@example.org wrote:
> > From: Michael Qiu <email@example.com>
> > v5: reformat the commit log with backtrace of main thread
> > Add a boolean variable to make main thread could re-acquire
> > aio_context on success path.
> > v4: rebase to latest code
> > v3: reformat the commit log, remove duplicate content
> patch history shouldn't go into commit message. So you should place it
> under '---' [*], after calling git format-patch
> > Currently, if guest has workloads, IO thread will acquire aio_context
> > lock before do io_submit, it leads to segmentfault when do block commit
> > after snapshot. Just like below:
> > Program received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0x7f7c7d91f700 (LWP 99907)]
> > 0x00005576d0f65aab in bdrv_mirror_top_pwritev at ../block/mirror.c:1437
> > 1437 ../block/mirror.c: No such file or directory.
> > (gdb) p s->job
> > $17 = (MirrorBlockJob *) 0x0
> > (gdb) p s->stop
> > $18 = false
> > Call trace of IO thread:
> > 0 0x00005576d0f65aab in bdrv_mirror_top_pwritev at ../block/mirror.c:1437
> > 1 0x00005576d0f7f3ab in bdrv_driver_pwritev at ../block/io.c:1174
> > 2 0x00005576d0f8139d in bdrv_aligned_pwritev at ../block/io.c:1988
> > 3 0x00005576d0f81b65 in bdrv_co_pwritev_part at ../block/io.c:2156
> > 4 0x00005576d0f8e6b7 in blk_do_pwritev_part at
> > ../block/block-backend.c:1260
> > 5 0x00005576d0f8e84d in blk_aio_write_entry at
> > ../block/block-backend.c:1476
> > ...
> > Switch to qemu main thread:
> > 0 0x00007f903be704ed in __lll_lock_wait at
> > /lib/../lib64/libpthread.so.0
> > 1 0x00007f903be6bde6 in _L_lock_941 at /lib/../lib64/libpthread.so.0
> > 2 0x00007f903be6bcdf in pthread_mutex_lock at
> > /lib/../lib64/libpthread.so.0
> > 3 0x0000564b21456889 in qemu_mutex_lock_impl at
> > ../util/qemu-thread-posix.c:79
> > 4 0x0000564b213af8a5 in block_job_add_bdrv at ../blockjob.c:224
> > 5 0x0000564b213b00ad in block_job_create at ../blockjob.c:440
> > 6 0x0000564b21357c0a in mirror_start_job at ../block/mirror.c:1622
> > 7 0x0000564b2135a9af in commit_active_start at ../block/mirror.c:1867
> > 8 0x0000564b2133d132 in qmp_block_commit at ../blockdev.c:2768
> > 9 0x0000564b2141fef3 in qmp_marshal_block_commit at
> > qapi/qapi-commands-block-core.c:346
> > 10 0x0000564b214503c9 in do_qmp_dispatch_bh at
> > ../qapi/qmp-dispatch.c:110
> > 11 0x0000564b21451996 in aio_bh_poll at ../util/async.c:164
> > 12 0x0000564b2146018e in aio_dispatch at ../util/aio-posix.c:381
> > 13 0x0000564b2145187e in aio_ctx_dispatch at ../util/async.c:306
> > 14 0x00007f9040239049 in g_main_context_dispatch at
> > /lib/../lib64/libglib-2.0.so.0
> > 15 0x0000564b21447368 in main_loop_wait at ../util/main-loop.c:232
> > 16 0x0000564b21447368 in main_loop_wait at ../util/main-loop.c:255
> > 17 0x0000564b21447368 in main_loop_wait at ../util/main-loop.c:531
> > 18 0x0000564b212304e1 in qemu_main_loop at ../softmmu/runstate.c:721
> > 19 0x0000564b20f7975e in main at ../softmmu/main.c:50
> > In IO thread when do bdrv_mirror_top_pwritev, the job is NULL, and stop
> > field
> > is false, this means the MirrorBDSOpaque "s" object has not been initialized
> > yet, and this object is initialized by block_job_create(), but the
> > initialize
> > process is stuck in acquiring the lock.
> > In this situation, IO thread come to bdrv_mirror_top_pwritev(),which means
> > that
> > mirror-top node is already inserted into block graph, but its
> > bs->opaque->job
> > is not initialized.
> > The root cause is that qemu main thread do release/acquire when hold the
> > lock,
> > at the same time, IO thread get the lock after release stage, and the crash
> > occured.
> > Actually, in this situation, job->job.aio_context will not equal to
> > qemu_get_aio_context(), and will be the same as bs->aio_context,
> > thus, no need to release the lock, becasue bdrv_root_attach_child()
> > will not change the context.
> > This patch fix this issue.
> > Fixes: 132ada80 "block: Adjust AioContexts when attaching nodes"
> > Signed-off-by: Michael Qiu <firstname.lastname@example.org>
> I feel like there may be more problems (like the fact that drained
> section should be expanded, and that expanding doesn't help as Michael
> said), but I think that temporary releasing locks is unsafe thing, and
> if we can avoid it for some cases it's good, especially if it fixes
> some bug:
Yeah, I don't like this patch much because it doesn't really fix the
bug, but it just restricts it to fewer cases. Whenever we add a node to
the job that is in a different AioContext than the job itself, we can
still run into similar problems.
Maybe we should actually make this an error case so that we never
release the lock.
In practice, I think all block jobs call block_job_create() with their
filter node, so the job will always be in the same AioContext and at
least things relating to new requests should never run into this case.
I also don't understand why draining doesn't work. This sounds a bit
concerning and probably deserved some more investigation.
Anyway, if all that remains is theoretical cases, I guess applying this
band-aid fix is better than not doing anything, so I'll apply it.