qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 5/7] block/mirror: utilize job_exit shim


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH 5/7] block/mirror: utilize job_exit shim
Date: Wed, 29 Aug 2018 10:28:57 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 2018-08-28 22:25, John Snow wrote:
> 
> 
> On 08/25/2018 11:02 AM, Max Reitz wrote:
>> If you say so...  I have to admit I don't really understand.  The
>> comment doesn't explain why it's so important to keep src around until
>> job_completed(), so I don't know.  I thought AioContexts are recursive
>> so it doesn't matter whether you take them recursively or not.
> 
> bdrv_flush has troubles under a recursive lock if it is invoked from a
> different thread. It tries to poll for flush completion but the
> coroutine which gets scheduled (instead of entered) can't actually run
> if we hold the lock twice from, say, the main thread -- which is where
> we're doing the graph manipulation from.
> 
>>        co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co);
>>        bdrv_coroutine_enter(bs, co);
>>        BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
> 
> BDRV_POLL_WHILE there causes us the grief via AIO_WAIT_WHILE, which only
> puts down one reference, so we deadlock in bdrv_flush in a recursive
> context.
> 
> Kevin helped me figure it out; I CC'd him off-list on a mail that you
> were also CC'd on ("hang in mirror_exit") that's probably pretty helpful:
> 
>> Took a little more than five minutes, but I think I've got it now. The
>> important thing is that the test case is for dataplane, i.e. the job
>> runs in an I/O thread AioContext. Job completion, however, happens in
>> the main loop thread.
>>
>> Therefore, when bdrv_delete() wants to flush the node first, it needs to
>> run bdrv_co_flush() in a foreign AioContext, so the coroutine is
>> scheduled. The I/O thread backtrace shows that it's waiting for the
>> AioContext lock before it can actually enter the bdrv_co_flush()
>> coroutine, so we must have deadlocked:

OK, that makes more sense.  I still would have thought that you should
always be allowed to take an AioContext lock more than once in a single
other AioContext, but on my way through the commit log, I found
bd6458e410c1e7, d79df2a2ceb3cb07711, and maybe most importantly
17e2a4a47d4.  So maybe not.

So at some point we decided that, yeah, you can take them multiple times
in a single context, and, yeah, that was how it was designed, but don't
do that if you expect a BDRV_POLL_WHILE().

OK.  Got it now.

Max

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]