qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Date: Fri, 20 Nov 2020 21:19:34 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0

20.11.2020 20:22, Kevin Wolf wrote:
Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
20.11.2020 19:36, Kevin Wolf wrote:
Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
Hi all!

As Peter recently noted, iotest 30 accidentally fails.

I found that Qemu crashes due to interleaving of graph-update
operations of parallel mirror and stream block-jobs.

I haven't found the time yet to properly look into this or your other
thread where you had a similar question, but there is one thing I'm
wondering: Why can the nested job even make progress and run its
completion handler?

When we modify the graph, we should have drained the subtree in
question, so in theory while one job finishes and modifies the graph,
there should be no way for the other job to make progress and get
interleaved - it shouldn't be able to start I/O requests and much less
to run its completion handler and modify the graph.

Are we missing drained sections somewhere or do they fail to achieve
what I think they should achieve?


It all looks like both jobs are reached their finish simultaneously.
So, all progress is done in both jobs. And they go concurrently to
completion procedures which interleaves. So, there no more io through
blk, which is restricted by drained sections.

They can't be truly simultaneous because they run in the same thread.
During job completion, this is the main thread.

No, they not truly simultaneous, but completions may interleave through nested 
aio_poll loops.


However as soon as job_is_completed() returns true, it seems we're not
pausing the job any more when one of its nodes gets drained.

Possibly also relevant: The job->busy = false in job_exit(). The comment
there says it's a lie, but we might deadlock otherwise.

This problem will probably affect other callers, too, which drain a
subtree and then resonably expect that nobody will modify the graph
until they end the drained section. So I think the problem that we need
to address is that jobs run their completion handlers even though they
are supposed to be paused by a drain.

Hmm. I always thought about drained section as about thing that stops IO 
requests, not other operations.. And we do graph modifications in drained 
section to avoid in-flight IO requests during graph modification.


I'm not saying that your graph modification locks are a bad idea, but
they are probably not a complete solution.


Hmm. What do you mean? It's of course not complete, as I didn't protect every 
graph modification procedure.. But if we do protect all such things and do 
graph modifications always under this mutex, it should work I think.

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]