Re: [PATCH] block: Fix BB.root changing across bdrv

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] block: Fix BB.root changing across bdrv_next()

From:	Hanna Reitz
Subject:	Re: [PATCH] block: Fix BB.root changing across bdrv_next()
Date:	Fri, 11 Mar 2022 16:20:09 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

On 01.03.22 18:39, Hanna Reitz wrote:

bdrv_next() has no guarantee that its caller has stopped all block graph
operations; for example, bdrv_flush_all() does not.

The latter can actually provoke such operations, because its
bdrv_flush() call, which runs a coroutine (bdrv_co_flush()), may run
this coroutine in a different AioContext than the main one, and then
when this coroutine is done and invokes aio_wait_kick(), the monitor may
get a chance to run and start executing some graph-modifying QMP
command.

One example for this is when the VM encounters an I/O error on a block
device and stops, triggering a bdrv_flush_all(), and a blockdev-mirror
is started simultaneously on a block node in an I/O thread.  When
bdrv_flush_all() comes to that node[1] and flushes it, the
aio_wait_kick() at the end of bdrv_co_flush_entry() may cause the
monitor to process the mirror request, and mirror_start_job() will then
replace the node by a mirror filter node, before bdrv_flush_all()
resumes and can invoke bdrv_next() again to continue iterating.

[1] Say there is a BlockBackend on top of the node in question, and so
bdrv_next() finds that BB and returns the node as the BB's blk_bs().
bdrv_next() will bdrv_ref() the node such that it remains valid through
bdrv_flush_all()'s iteration, and drop the reference when it is called
the next time.

The problem is that bdrv_next() does not store to which BDS it retains a
strong reference when the BDS is a BB's child, so on the subsequent
call, it will just invoke blk_bs() again and bdrv_unref() the returned
node -- but as the example above shows, this node might be a different
one than the one that was bdrv_ref()-ed before.  This can lead to a
use-after-free (for the mirror filter node in our example), because this
negligent bdrv_unref() would steal a strong reference from someone else.

We can solve this problem by always storing the returned (and strongly
referenced) BDS in BdrvNextIterator.bs.  When we want to drop the strong
reference of a BDS previously returned, always drop BdrvNextIterator.bs
instead of using other ways of trying to figure out what that BDS was
that we returned last time.

So a week ago, Kevin and me talked about this on IRC, and he was ratherapprehensive of this approach, because (1) it fixes a probablyhigh-level problem in one specific low-level place, and (2) it’s noteven quite clear whether even this specific problem is really fixed.

(For (2): If bdrv_next() can cope with graph changes, then if such achange occurs during bdrv_flush_all(), it isn’t entirely clear whetherwe’ve truly iterated over all nodes and flushed them all.)

I’ve put a more detailed description of what I think is happening stepby step here: https://bugzilla.redhat.com/show_bug.cgi?id=2058457#c7

So, the question came up whether we shouldn’t put bdrv_flush_all() intoa drained section (there is a bdrv_drain_all() before, it’s just not asection), and ensure that no QMP commands can be executed in drainedsections. I fiddled around a bit, just wanting to send an extremelyrough RFC to see whether the direction I’d be going in made any sense atall, but I’m not really making progress:

I wanted to basically introduce an Rwlock for QMP request processing,and take a read lock while we’re in a drained section. That doesn’t workso well, though, because when a QMP command (i.e. Rwlock is taken for awriter) uses drain (trying to take it as a reader), there’s a deadlock. I don’t really have a good idea to consolidate this, because if runningQMP commands during drains is forbidden, then, well, a QMP commandcannot use drain. We’d need some way to identify that the drain isbased in the currently running QMP command, and allow that, but I reallydon’t know how.

While looking into the STOP behavior further, something else came up. Summarizing part of my comment linked above, part of the problem is thatvm_stop() runs, and concurrently the guest device requests another STOP;therefore, the next main loop iteration will try to STOP again. Thatseems to be why the monitor makes some progress during one main loopiteration, and then the next unfortunate sequence of polling can lead tothe monitor processing a QMP command.

So other things to consider could be whether we should ensure that themonitor is not in danger of processing QMP requests whenmain_loop_should_exit() runs, e.g. by polling until it’s back at the topof its main loop (in monitor_qmp_dispatcher_co()).

Or whether we could make qemu_system_vmstop_request() prevent doublestop requests, by forbidding STOP requests while a previous STOP requesthas not yet been completely processed (i.e. accepting another one onlyonce the VM has been stopped one time).

The simplest way to fix this very issue I’m seeing at least would bejust to pull do_vm_stop()’s bdrv_drain_all()+bdrv_flush_all() into itsconditional path; i.e. to only do this if the VM hadn’t been alreadystopped. (I don’t think we need to flush here if the VM was alreadystopped before. Might be wrong, though.) I doubt that’s a greatsolution, but it’d be simple at least.


Hanna

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] block: Fix BB.root changing across bdrv_next(), Hanna Reitz, 2022/03/01
- Re: [PATCH] block: Fix BB.root changing across bdrv_next(), Hanna Reitz <=

Prev by Date: Re: [PATCH v5 13/15] hw/nvme: Add support for the Virtualization Management command
Next by Date: [PATCH v2] block-qdict: Fix -Werror=maybe-uninitialized build failure
Previous by thread: [PATCH] block: Fix BB.root changing across bdrv_next()
Next by thread: [PATCH 0/3] block: copy-before-write: on-cbw-error behavior
Index(es):
- Date
- Thread