qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->childr


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Date: Mon, 21 Mar 2022 18:44:53 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

09.03.2022 16:26, Emanuele Giuseppe Esposito wrote:

Am 02/03/2022 um 12:07 schrieb Vladimir Sementsov-Ogievskiy:
01.03.2022 17:21, Emanuele Giuseppe Esposito wrote:
This serie tries to provide a proof of concept and a clear explanation
on why we need to use drains (and more precisely subtree_drains)
to replace the aiocontext lock, especially to protect BlockDriverState
->children and ->parent lists.

Just a small recap on the key concepts:
* We split block layer APIs in "global state" (GS), "I/O", and
"global state or I/O".
    GS are running in the main loop, under BQL, and are the only
    one allowed to modify the BlockDriverState graph.

    I/O APIs are thread safe and can run in any thread

    "global state or I/O" are essentially all APIs that use
    BDRV_POLL_WHILE. This is because there can be only 2 threads
    that can use BDRV_POLL_WHILE: main loop and the iothread that
    runs the aiocontext.

* Drains allow the caller (either main loop or iothread running
the context) to wait all in_flights requests and operations
of a BDS: normal drains target a given node and is parents, while
subtree ones also include the subgraph of the node. Siblings are
not affected by any of these two kind of drains.
After bdrv_drained_begin, no more request is allowed to come
from the affected nodes. Therefore the only actor left working
on a drained part of the graph should be the main loop.

What do we intend to do
-----------------------
We want to remove the AioContext lock. It is not 100% clear on how
many things we are protecting with it, and why.
As a starter, we want to protect BlockDriverState ->parents and
->children lists, since they are read by main loop and I/O but
only written by main loop under BQL. The function that modifies
these lists is bdrv_replace_child_common().

How do we want to do it
-----------------------
We individuated as ideal subtitute of AioContext lock
the subtree_drain API. The reason is simple: draining prevents the
iothread to read or write the nodes, so once the main loop finishes
I'm not sure it's ideal. Unfortunately I'm not really good in all that
BQL, AioContext locks and drains. So, I can't give good advice. But here
are my doubts:

Draining is very restrictive measure: even if drained section is very
short, at least on bdrv_drained_begin() we have to wait for all current
requests, and don't start new ones. That slows down the guest.
I don't think we are in a critical path where performance is important here.

In the
same time there are operations that don't require to stop guest IO
requests. For example manipulation with dirty bitmaps - qmp commands
block-dirty-bitmap-add block-dirty-bitmap-remove. Or different query
requests..

Maybe you misunderstood or I was not 100% clear, but I am talking about replacing the 
AioContext lock for the ->parents and ->children instance. Not everywhere. This 
is the first step, and then we will see if the additional things that it protects can 
use drain or something else

I see only two real cases, where we do need drain:

1. When we need a consistent "point-in-time". For example, when we start
backup in transaction with some dirty-bitmap manipulation commands.

2. When we need to modify block-graph: if we are going to break relation
A -> B, there must not be any in-flight request that want to use this
relation.
That's the use case I am considering.
All other operations, for which we want some kind of lock (like
AioContext lock or something) we actually don't want to stop guest IO.
Yes, they have to be analyzed case by case.

Next, I have a problem in mind, that in past lead to a lot of iotest 30
failures. Next there were different fixes and improvements, but the core
problem (as far as I understand) is still here: nothing protects us when
we are in some graph modification process (for example block-job
finalization) do yield, switch to other coroutine and enter another
graph modification process (for example, another block-job finaliztion)..
That's another point to consider. I don't really have a solution for this.

(for details look at my old "[PATCH RFC 0/5] Fix accidental crash in
iotest 30"
https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html  ,
where I suggested to add a global graph_modify_mutex CoMutex, to be held
during graph-modifying process that may yield)..
Does your proposal solve this problem?


executing bdrv_drained_begin() on the interested graph, we are sure that
the iothread is not going to look or even interfere with that part of
the graph.
We are also sure that the only two actors that can look at a specific
BlockDriverState in any given context are the main loop and the
iothread running the AioContext (ensured by "global state or IO" logic).

Why use_subtree_  instead of normal drain
-----------------------------------------
A simple drain "blocks" a given node and all its parents.
But it doesn't touch the child.
This means that if we use a simple drain, a child can always
keep processing requests, and eventually end up calling itself
bdrv_drained_begin, ending up reading the parents node while the main
loop
is modifying them. Therefore a subtree drain is necessary.
I'm not sure I understand.. Could you give a more concrete example of
what may happen if we use simple drain?
Who will call bdrv_drained_begin() you say about? Some IO path in
parallel thread? Do we really have an IO path that calls
bdrv_drained_begin()?
It is already being done in mirror, where it drains while running the .run() 
Job callback.

I made an unit test for this:
https://gitlab.com/eesposit/qemu/-/commit/1ffd7193ed441020f51aeeb49c3b07edb6949760

assume that you have the following graph:

blk (blk) -> overlay (bs)
overlay (bs) --- backed by ---> backing (bs)
job1 (blockjob) --> backing

Then the main loop calls bdrv_drained_begin(overlay)
This means overlay and bs are under drain, but backing and most importantly the 
job are not.
At this point, the job run() function decides to drain. It should not be
allowed to read overlay parents list for example, but instead there is nothing
to prevent it from doing that, and thus we see drains_done >0.

On the other side, when we subtree_drain, also the job is drained and thus it 
goes in
pause.

Makes sense?


As I understand:

You want to make drained section to be a kind of lock, so that if we take this 
lock, we can modify the graph and we are sure that no other client will modify 
it in parallel.

But drained sections can be nested. So to solve the problem you try to drain 
more nodes: include subtree for example, or may be we need to drain the whole 
graph connectivity component, or (to be more safe) the whole block layer (to be 
sure that during drained section in one connectivity component some not-drained 
block-job from another connectivity component will not try to attach some node 
from our drained connectivity component)..

I still feel that draining is wrong tool for isolating graph modifying 
operations from each other:

1. Drained sections can be nested, and natively that's not a kind of lock. 
That's just a requirement to have no IO requests. There may be several clients 
that want this condition on same set of nodes.

2. Blocking IO on the whole connected subgraph or even on the whole block layer 
graph is not necessary, so that's an extra blocking.


Could we instead do the following:

1. Keep draining as is - a mechanism to stop IO on some nodes

2. To isolate graph-modifying operations implement another mechanism: something 
like a global queue, where clients wait until they gen an access to modify 
block layer.


This way, any graph modifying process would look like this:

1. drained_begin(only where necessary, not the whole subgraph in general)

2. wait in the global queue

3. Ok, now we can do all the modifications

4. Kick the global queue, so that next client will get an access

5. drained_end()


--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]