[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex

From: Vladimir Sementsov-Ogievskiy
Subject: Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex
Date: Tue, 13 Jul 2021 18:18:39 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

13.07.2021 16:10, Stefan Hajnoczi wrote:
On Mon, Jul 12, 2021 at 10:41:46AM +0200, Emanuele Giuseppe Esposito wrote:

On 08/07/2021 15:04, Stefan Hajnoczi wrote:
On Thu, Jul 08, 2021 at 01:32:12PM +0200, Paolo Bonzini wrote:
On 08/07/21 12:36, Stefan Hajnoczi wrote:
What is very clear from this patch is that it
is strictly related to the brdv_* and lower level calls, because
they also internally check or even use the aiocontext lock.
Therefore, in order to make it work, I temporarly added some
aiocontext_acquire/release pair around the function that
still assert for them or assume they are hold and temporarly
unlock (unlock() - lock()).

Sounds like the issue is that this patch series assumes AioContext locks
are no longer required for calling the blk_*()/bdrv_*() APIs? That is
not the case yet, so you had to then add those aio_context_lock() calls
back in elsewhere. This approach introduces unnecessary risk. I think we
should wait until blk_*()/bdrv_*() no longer requires the caller to hold
the AioContext lock before applying this series.

In general I'm in favor of pushing the lock further down into smaller and
smaller critical sections; it's a good approach to make further audits
easier until it's "obvious" that the lock is unnecessary.  I haven't yet
reviewed Emanuele's patches to see if this is what he's doing where he's
adding the acquire/release calls, but that's my understanding of both his
cover letter and your reply.

The problem is the unnecessary risk. We know what the goal is for
blk_*()/bdrv_*() but it's not quite there yet. Does making changes in
block jobs help solve the final issues with blk_*()/bdrv_*()?

Correct me if I am wrong, but it seems to me that the bdrv_*()/blk_*()
operation mostly take care of building, modifying and walking the bds graph.
So since graph nodes can have multiple AioContext, it makes sense that we
have a lock when modifying the graph, right?

If so, we can simply try to replace the AioContext lock with a graph lock,
or something like that. But I am not sure of this.

Block graph manipulation (all_bdrv_states and friends) requires the BQL.
It has always been this way.

This raises the question: if block graph manipulation is already under
the BQL and BlockDriver callbacks don't need the AioContext anymore, why

I don't believe that block drivers are thread-safe now. They have some 
mutexes.. But who make an audit of thread-safety? I work mostly with nbd and 
qcow2 drivers, and they never seemd to be thread-safe to me. For example, qcow2 
driver has enough operations that are done from non-coroutine context and 
therefore qcow2 co-mutex just not locked.

are aio_context_acquire() calls still needed in block jobs?

AIO_WAIT_WHILE() requires that AioContext is acquired according to its
documentation, but I'm not sure that's true anymore. Thread-safe/atomic
primitives are used by AIO_WAIT_WHILE(), so as long as the condition
being waited for is thread-safe too it should work without the
AioContext lock.

Back to my comment about unnecessary risk, pushing the lock down is a
strategy for exploring the problem, but I'm not sure those intermediate
commits need to be committed to qemu.git/master because of the time
required to review them and the risk of introducing (temporary) bugs.

I agree. Add my bit of criticism:

What I dislike about the whole thread-safe update you do:

1. There is no proof of concept - some good example of multiqueue, or something that uses mutliple 
threads and shows good performance. Something that works, and shows where are we going to and why 
it is good. That may be draft patches with a lot of "FIXME" and "TODO", but 
working. For now I feel that I've spent my time to reviewing and proving to myself thread-safety of 
two previous thread-safe series, but I don't have a hope to see a benefit of it in the near future..

1.1 If we have a proof of concept, that also gives a kind of plan: a list of 
subsystems (patch series) to do step by step and finally come to what we want. 
Do you have a kind of plan (of the whole feature) now?

2. There are no tests: something that doesn't work before the series and start 
to work after. Why it is important:

All these thread-safe primitives are complicated enough. They hard to review 
and prove correctness. And very simple to break by new code. Tests that runs by 
CI proves that we don't break subsystems that are already thread-safe. For 
example, you've recently updated block-copy and several related things. But we 
have no tests. So, assume, a year later you finish the work of updating all 
other subsystems to be thread-safe. You'll have no guarantee that block-copy is 
still thread-safe, and you'll have to start from the beginning.

3. As I said, I really doubt that block drivers are already thread safe. An 
audit and/or tests are needed at least.

Best regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]