Re: [PATCH v2] block/stream: Drain subtree around graph change

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] block/stream: Drain subtree around graph change

From:	Emanuele Giuseppe Esposito
Subject:	Re: [PATCH v2] block/stream: Drain subtree around graph change
Date:	Wed, 30 Mar 2022 11:40:42 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0


Am 29/03/2022 um 14:15 schrieb Hanna Reitz:
> On 29.03.22 11:55, Vladimir Sementsov-Ogievskiy wrote:
>> 29.03.2022 11:54, Hanna Reitz wrote:
>>> On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
>>>> 28.03.2022 11:09, Hanna Reitz wrote:
>>>>> On 28.03.22 09:44, Hanna Reitz wrote:
>>>>>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>>>>>> When the stream block job cuts out the nodes between top and
>>>>>>>> base in
>>>>>>>> stream_prepare(), it does not drain the subtree manually; it
>>>>>>>> fetches the
>>>>>>>> base node, and tries to insert it as the top node's backing node
>>>>>>>> with
>>>>>>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will
>>>>>>>> drain, and so
>>>>>>>> the actual base node might change (because the base node is
>>>>>>>> actually not
>>>>>>>> part of the stream job) before the old base node passed to
>>>>>>>> bdrv_set_backing_hd() is installed.
>>>>>>>>
>>>>>>>> This has two implications:
>>>>>>>>
>>>>>>>> First, the stream job does not keep a strong reference to the
>>>>>>>> base node.
>>>>>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>>>>>> because some other block job is drained to finish), we will get a
>>>>>>>> use-after-free.  We should keep a strong reference to that node.
>>>>>>>>
>>>>>>>> Second, even with such a strong reference, the problem remains
>>>>>>>> that the
>>>>>>>> base node might change before bdrv_set_backing_hd() actually
>>>>>>>> runs and as
>>>>>>>> a result the wrong base node is installed.
>>>>>>>
>>>>>>> Hmm.
>>>>>>>
>>>>>>> So, we don't really need a strong reference, as if it helps to
>>>>>>> avoid some use-after-free, it means that we'll finish up with
>>>>>>> wrong block graph..
>>>>>>
>>>>>> Sure.  But I found it better style to strongly reference a node
>>>>>> while it’s used.  I’d rather have an outdated block graph (as in:
>>>>>> A node that was supposed to disappear would still be in use) than
>>>>>> a use-after-free.
>>>>>>
>>>>>>> Graph modifying operations must be somehow isolated from each other.
>>>>>>>
>>>>>>>>
>>>>>>>> Both effects can be seen in 030's
>>>>>>>> TestParallelOps.test_overlapping_5()
>>>>>>>> case, which has five nodes, and simultaneously streams from the
>>>>>>>> middle
>>>>>>>> node to the top node, and commits the middle node down to the
>>>>>>>> base node.
>>>>>>>> As it is, this will sometimes crash, namely when we encounter the
>>>>>>>> above-described use-after-free.


Interesting. This test was badly crashing when I was applying the patch
"[PATCH 6/6] jobs: ensure sleep in job_sleep_ns is fully performed"

https://patchew.org/QEMU/20220208153655.1251658-1-eesposit@redhat.com
/20220208153655.1251658-7-eesposit@redhat.com/

I could not manage to find what was the root cause of this, but I
remember test_overlapping_5 and test_stream_parallel were sometimes
failing.
I even tried running the tests in the main branch and could not manage
to reproduce the random failure you describe above.


>>>>>>>>
>>>>>>>> Taking a strong reference to the base node, we no longer get a
>>>>>>>> crash,
>>>>>>>> but the resuling block graph is less than ideal: The expected
>>>>>>>> result is
>>>>>>>> obviously that all middle nodes are cut out and the base node is
>>>>>>>> the
>>>>>>>> immediate backing child of the top node.  However, if
>>>>>>>> stream_prepare()
>>>>>>>> takes a strong reference to its base node (the middle node), and
>>>>>>>> then
>>>>>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly
>>>>>>>> dropping
>>>>>>>> that middle node, the stream job will just reinstall it again.
>>>>>>>>
>>>>>>>> Therefore, we need to keep the whole subtree drained in
>>>>>>>> stream_prepare(), so that the graph modification it performs is
>>>>>>>> effectively atomic, i.e. that the base node it fetches is still
>>>>>>>> the base
>>>>>>>> node when bdrv_set_backing_hd() sets it as the top node's
>>>>>>>> backing node.
>>>>>>>
>>>>>>> Emanuele has similar idea of isolating graph changes from each
>>>>>>> other by subtree-drain.
>>>>>>>
>>>>>>> If I understand correctly the idea is that we'll drain all other
>>>>>>> block jobs, so the wouldn't do their block-graph modification
>>>>>>> during drained section. So, we can safely modify the graph.
>>>>>>>
>>>>>>> I don't like this idea:
>>>>>>>
>>>>>>> 1. drained section = stop IO. But we don't need to stop IO in the
>>>>>>> whole subtree to do a needed block-graph modification.
>>>>>>
>>>>>> If you mean to say that draining just the single node should be
>>>>>> sufficient, I’ll be happy to change it.
>>>>>>
>>>>>> Not sure which node, though, because I’d think it would be `base`,
>>>>>> but to safely fetch it I’d need to drain it, which seems to bite
>>>>>> itself in the tail.  That’s why I went for a subtree drain from
>>>>>> `above_base`.
>>>>>>
>>>>>>> 2. Drained section is not a lock, several clients may drain same
>>>>>>> set of nodes.. So we exploit the fact that concurrent clients
>>>>>>> will be paused by drained section and don't proceed to
>>>>>>> graph-modification code.. But are we sure that block-jobs are
>>>>>>> (and will be?) the only concurrent block-graph modifying clients?
>>>>>>> Can qmp commands interleave somehow?
>>>>>>
>>>>>> They can under very specific circumstances and that’s a bug. See
>>>>>> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html
>>>>>> .
>>>>>>
>>>>>>> Can some jobs from other subtree start a block-graph modification
>>>>>>> that touches our subtree?
>>>>>>
>>>>>> That would be wrong.  A block job shouldn’t change nodes it
>>>>>> doesn’t own; stream doesn’t own the base, but it also doesn’t
>>>>>> change it, it only needs to have the top node point to it.
>>>>>>
>>>>>>> If go this way, that would be more safe to drain the whole
>>>>>>> block-graph on any block-graph modification..
>>>>>>>
>>>>>>> I think we'd better have a separate global mechanism for
>>>>>>> isolating graph modifications. Something like a global co-mutex
>>>>>>> or queue, where clients waits for their turn in block graph
>>>>>>> modifications.
>>>>>>>
>>>>>>> Here is my old proposal on that topic:
>>>>>>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
>>>>>>>
>>>>>>
>>>>>> That would only solve the very specific issue in 030, right?
>>>>
>>>> It should solve the same issue as your patch. You don't add
>>>> subtree_drain around every graph modification.. Or we already have it?
>>>
>>> Well, I’m not saying it will solve every single bug, but draining in
>>> stream_prepare() will at least mean that that is safe from basically
>>> anything else, because it will prevent concurrent automatic graph
>>> changes (e.g. because of jobs finishing), and QMP commands shouldn’t
>>> be executed in drained sections either (when they do, it’s wrong, but
>>> that seems to occur only extremely rarely).  Draining alone should
>>> make a place safe, it isn’t a lock that all sides need to take.
>>>
>>>>>>   The stream job isn’t protected from any graph modifications but
>>>>>> those coming from mirror.  Might be a solution going forward (I
>>>>>> didn’t look closer at it at the time, given I saw you had a
>>>>>> discussion with Kevin), if we lock every graph change operation
>>>>>> (though a global lock honestly doesn’t sound strictly better than
>>>>>> draining subsections of the graph, both have their drawbacks), but
>>>>>> that doesn’t look like it’d be something for 7.1.
>>>>
>>>> Same way, with draining solution you should make a subtree-drain for
>>>> every graph change operation.
>>>
>>> Since we don’t have any lock yet, draining is the de-facto way we use
>>> to forbid concurrent graph modifications.  I’m not saying we use it
>>> correctly and thoroughly, but it is what we do right now.
>>>
>>>>
>>>>>
>>>>> I wonder whether we could have a short-term version of
>>>>> `BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set,
>>>>> you just can’t change the graph, and you also can’t wait, so that’s
>>>>> just an error.  But if `.frozen_lock` is set, you can wait on it.
>>>>> Here, we’d keep `.frozen` set for all links between top and
>>>>> above_base, and then in prepare() take `.frozen_lock` on the link
>>>>> between above_base and base.
>>>>>
>>>>
>>>> Yes that's seems an alternative to global lock, that doesn't block
>>>> the whole graph. Still, I don't think that is bad to lock the whole
>>>> graph for graph modificaiton, as modification should be rare and fast.
>>>
>>> Fair enough.
>>>
>>>> Another thought: does subtree-drain really drain the whole
>>>> connectivity component of the graph?
>>>>
>>>> imagine something like this:
>>>>
>>>> [A]  [   C  ]
>>>>  |    |    |
>>>>  v    v    v
>>>> [ B    ]  [ D ]
>>>>
>>>>
>>>> If we do subtree drain at node A, this will drain B and C, but not D..
>>>>
>>>> Imagine, some another job is attached to node D, and it will start
>>>> drained section too. So, for example both jobs will share drained
>>>> section on node C. That doesn't seem save, and draining is not a lock.
>>>>
>>>> So, if we are going to rely on drained section as on lock, that
>>>> isolates graph modifications from each other, we should drain the
>>>> whole connectivity component of the graph.
>>>
>>> The drained section is not a lock, but if the other job is only
>>> attached to node D, it won’t change node C.  It might change the link
>>> from C to D, but that doesn’t concern the job that is concerned about
>>> A and B. Overlapping drains are fine.
>>
>> OK. Maybe it works. It's just not obvious to me that subtree_drain
>> works good in all cases. And global graph-modification-lock should
>> obviously work.
>>
>>>
>>>> Next, I'm not relly sure that two jobs can simultaneusly enter
>>>> drained section and do graph modifications. What prevents this?
>>>> Assume two block-stream jobs reaches their finish simultaneously and
>>>> go to subtree-drain. That just means that job_pause will be called
>>>> for both jobs.. But what that means for the block-stream jobs that
>>>> is in bdrv_subtree_drained_beeing() call in stream_prepare()? Seems
>>>> nothing? Then both jobs will start graph modification process
>>>> simultaneously and can interleave on any yield point (for exmaple
>>>> rewriting backing_file in qcow2 metadata).
>>>
>>> So I don’t think that scenario can really happen, because the stream
>>> job freezes the chain between above_base and top, so you can’t really
>>> have two simultaneous stream jobs that cause problems between each
>>> other.
>>
>> They cause problem on the boundary, as base of one stream job may be
>> top of another, and we have also a filter, that should be
>> inserted/removed at some moment. As I remember, that's the problematic
>> case in 030..
>>
>>>
>>> Furthermore, the prepare() functions are run in the main thread, so
>>> the only real danger is actually that draining around the actual
>>> graph modification (bdrv_set_backing_hd()) causes another block job
>>> to finish, modifying the block graph.  But then that job will also
>>> actually finish, because it’s all in the main thread.
>>>
>>> It is true that child_job_drained_poll() says that job that are about
>>> to prepare() are quiesced, but I don’t think that’s a problem, given
>>> that all jobs in that state run in the main thread.
>>>
>>>>
>>>> Another reason, why I think that subtree drain is a wrong tool, as I
>>>> said, is extra IO-stop.
>>>
>>> I know and agree, but that’s an optimization question.
>>>
>>>> Imaging the following graph:
>>>>
>>>> [A]
>>>>  |
>>>>  v
>>>> [B] [C]
>>>>  |   |
>>>>  v   v
>>>> [base]
>>>>
>>>> If we want to rebase A onto base, we actually need only stop IO
>>>> requests in A and B. Why C should suffer from this graph
>>>> modification? IO requests produced by C, and that are living in C
>>>> and in base don't intersect with rebasing A on base process in any way.
>>>>
>>>> ====
>>>>
>>>> Actually, I'm not strictly against your patch, and believe that it
>>>> fixes the problem in most cases. And it's probably OK in short term.
>>>> The only real doubt on including it now is that drained sections
>>>> sometimes lead to dead locks, and is it possible that we now fix the
>>>> bug that happens only in iotest 30 (or is it reported somewhere?)
>>>> and risking to introduce some dead-lock?
>>>
>>> Saying that the example in 030 is contrived would mean we
>>> could/should re-include the base into the list of nodes that belong
>>> to the stream job, which would simply disallow the case in 030 that’s
>>> being tested and fails.
>>>
>>> Then we don’t need a subtree drain, and the plain drain in
>>> bdrv_set_backing_hd() would be fine.
>>>
>>>> Seems that if in some code it's safe to call drained_begin(), it
>>>> should be safe to call subtree_drained_begin(). And if it trigger
>>>> some deadlock, it just shows some another bug.. Is it worth fixing
>>>> now, near to 7.0 release? We live with this bug for years.. Or
>>>> something changed that I missed?
>>>
>>> I mean...  I can understand your concern that adding a subtree drain
>>> has performance implications (when a stream job ends, which shouldn’t
>>> be often).  But I’m not sure whether I should share the deadlock
>>> concern. Sounds like a sad state of affairs if I can’t just drain
>>> something when I need it to be drained.
>>>
>>> I wasn’t aware of this bug, actually.  Now I am, and I feel rather
>>> uncomfortable living with a use-after-free bug, because that’s on the
>>> worse end of the bug spectrum.
>>>
>>
>> OK, I don't know:) And others are silent. Agree that global-lock
>> solution is not for 7.0 anyway. And this one is simple enough. So,
>> take my
>>
>> Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
> 
> Thanks!
> 
> A global lock solution sounds good to me for 7.1+, and it even provides
> a solution to solving the problem with QMP commands being executed while
> other main-thread code is running.  (I mean, the QMP command would still
> be executed, but at least concurrent graph changes would be impossible.)
> 

Discussion about subtree and this "global lock" is here:
20220301142113.163174-1-eesposit@redhat.com/">https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/

@Hanna if you can give it a look and jump into the party as well, that
would be nice.
Maybe what Kevin proposed is what we need instead of subtree drains.

Global lock sounds a little bit wrong naming btw, as it seems that we
are just re-implementing the AioContext lock.

Emanuele

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v2] block/stream: Drain subtree around graph change, (continued)
- Re: [PATCH v2] block/stream: Drain subtree around graph change, John Snow, 2022/03/24
  - Re: [PATCH v2] block/stream: Drain subtree around graph change, Hanna Reitz, 2022/03/25
- Re: [PATCH v2] block/stream: Drain subtree around graph change, Eric Blake, 2022/03/25
- Re: [PATCH v2] block/stream: Drain subtree around graph change, Vladimir Sementsov-Ogievskiy, 2022/03/25
  - Re: [PATCH v2] block/stream: Drain subtree around graph change, Hanna Reitz, 2022/03/28
    - Re: [PATCH v2] block/stream: Drain subtree around graph change, Hanna Reitz, 2022/03/28
    - Re: [PATCH v2] block/stream: Drain subtree around graph change, Vladimir Sementsov-Ogievskiy, 2022/03/28
    - Re: [PATCH v2] block/stream: Drain subtree around graph change, Hanna Reitz, 2022/03/29
    - Re: [PATCH v2] block/stream: Drain subtree around graph change, Vladimir Sementsov-Ogievskiy, 2022/03/29
    - Re: [PATCH v2] block/stream: Drain subtree around graph change, Hanna Reitz, 2022/03/29
    - Re: [PATCH v2] block/stream: Drain subtree around graph change, Emanuele Giuseppe Esposito <=

Prev by Date: RE: [PATCH v6 15/19] vfio-user: handle device interrupts
Next by Date: Re: [PATCH 1/2] target/s390x: Fix determination of overflow condition code after addition
Previous by thread: Re: [PATCH v2] block/stream: Drain subtree around graph change
Next by thread: Device driver api
Index(es):
- Date
- Thread