qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/4] colo: Introduce resource agent and high-level test


From: Dr. David Alan Gilbert
Subject: Re: [PATCH 0/4] colo: Introduce resource agent and high-level test
Date: Wed, 18 Dec 2019 19:46:25 +0000
User-agent: Mutt/1.13.0 (2019-11-30)

* Lukas Straub (address@hidden) wrote:
> On Wed, 27 Nov 2019 22:11:34 +0100
> Lukas Straub <address@hidden> wrote:
> 
> > On Fri, 22 Nov 2019 09:46:46 +0000
> > "Dr. David Alan Gilbert" <address@hidden> wrote:
> >
> > > * Lukas Straub (address@hidden) wrote:
> > > > Hello Everyone,
> > > > These patches introduce a resource agent for use with the Pacemaker CRM 
> > > > and a
> > > > high-level test utilizing it for testing qemu COLO.
> > > >
> > > > The resource agent manages qemu COLO including continuous replication.
> > > >
> > > > Currently the second test case (where the peer qemu is frozen) fails on 
> > > > primary
> > > > failover, because qemu hangs while removing the replication related 
> > > > block nodes.
> > > > Note that this also happens in real world test when cutting power to 
> > > > the peer
> > > > host, so this needs to be fixed.
> > >
> > > Do you understand why that happens? Is this it's trying to finish a
> > > read/write to the dead partner?
> > >
> > > Dave
> >
> > I haven't looked into it too closely yet, but it's often hanging in 
> > bdrv_flush()
> > while removing the replication blockdev and of course thats probably 
> > because the
> > nbd client waits for a reply. So I tried with the workaround below, which 
> > will
> > actively kill the TCP connection and with it the test passes, though I 
> > haven't
> > tested it in real world yet.
> >
> 
> In the real cluster, sometimes qemu even hangs while connecting to qmp (after 
> remote
> poweroff). But I currently don't have the time to look into it.

That doesn't surprise me too much; QMP is mostly handled in the main
thread, as are a lot of other things; hanging in COLO has been my
assumption for a while because of that.  However, there's a way to fix
it.

A while ago, Peter Xu added a feature called 'out of band' to QMP; you
can open a QMP connection, set the OOB feature, and then commands that
are marked as OOB are executed off the main thread on  that connection.

At the moment we've just got the one real OOB command, 'migrate-recover'
which is used for recovering postcopy from a similar failure to the COLO
case.

To fix this you'd have to convert colo-lost-heartbeat to be an OOB
command; note it's not that trivial, because you have to make sure the
code that's run as part of the OOB command doesn't take any locks that
could block on something in the main thread; so it can set flags, start
new threads, perhaps call shutdown() on a socket; but it takes some
thinking about.


> Still a failing test is better than no test. Could we mark this test as 
> known-bad and
> fix this issue later? How should I mark it as known-bad? By tag? Or warn in 
> the log?

Not sure of that; cc'ing Maybe thuth knows?

Dave

> Regards,
> Lukas Straub
> 
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]