qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30


From: Kevin Wolf
Subject: Re: [PATCH RFC 0/5] Fix accidental crash in iotest 30
Date: Fri, 20 Nov 2020 18:22:51 +0100

Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 20.11.2020 19:36, Kevin Wolf wrote:
> > Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > Hi all!
> > > 
> > > As Peter recently noted, iotest 30 accidentally fails.
> > > 
> > > I found that Qemu crashes due to interleaving of graph-update
> > > operations of parallel mirror and stream block-jobs.
> > 
> > I haven't found the time yet to properly look into this or your other
> > thread where you had a similar question, but there is one thing I'm
> > wondering: Why can the nested job even make progress and run its
> > completion handler?
> > 
> > When we modify the graph, we should have drained the subtree in
> > question, so in theory while one job finishes and modifies the graph,
> > there should be no way for the other job to make progress and get
> > interleaved - it shouldn't be able to start I/O requests and much less
> > to run its completion handler and modify the graph.
> > 
> > Are we missing drained sections somewhere or do they fail to achieve
> > what I think they should achieve?
> > 
> 
> It all looks like both jobs are reached their finish simultaneously.
> So, all progress is done in both jobs. And they go concurrently to
> completion procedures which interleaves. So, there no more io through
> blk, which is restricted by drained sections.

They can't be truly simultaneous because they run in the same thread.
During job completion, this is the main thread.

However as soon as job_is_completed() returns true, it seems we're not
pausing the job any more when one of its nodes gets drained.

Possibly also relevant: The job->busy = false in job_exit(). The comment
there says it's a lie, but we might deadlock otherwise.

This problem will probably affect other callers, too, which drain a
subtree and then resonably expect that nobody will modify the graph
until they end the drained section. So I think the problem that we need
to address is that jobs run their completion handlers even though they
are supposed to be paused by a drain.

I'm not saying that your graph modification locks are a bad idea, but
they are probably not a complete solution.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]