qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 0/6] monitor: allow per-monitor thread


From: Fam Zheng
Subject: Re: [Qemu-devel] [RFC 0/6] monitor: allow per-monitor thread
Date: Tue, 22 Aug 2017 10:15:56 +0800
User-agent: Mutt/1.8.3 (2017-05-23)

On Mon, 08/21 18:28, Dr. David Alan Gilbert wrote:
> > It's not much more than asserting qemu_mutex_iothread_locked(), the problem 
> > is
> > the new monitor thread breaks certain assumptions that was true.
> > 
> > What is interesting in this is that block layer's nested aio_poll() now not 
> > only
> > run in the main thread but also in the monitor thread. Bugs may hide there. 
> >  :)
> > 
> > That's why I suggested a "safe by default" strategy.
> 
> OK, that's going to need some more flags somewhere; we've now
> effectively got three types of command:
>    a) Commands that can only run in the main thread
>    b) Commands that can run in other monitor threads, but must have the bql
>    c) Commands that can run in other monitor threads but don't take the
>    bql
> 
>    The class (a) that you point out are a pain; arguably if we have to
> split them up then perhaps we should initially only allow (c).
> 
> > One step back, is it possible to "unblock" main thread even upon network 
> > issue?
> > What is the scenario that causes main thread hang? Is there a backtrace?
> 
> There are at least 3 scenarious I know of:
> 
>   a) Postcopy: An IO operation takes the lock and accesses guest memory;
>      the guest memory is missing due to userfault'd memory.
>      Unfortunately the network connection to the source happens to fail;
>       so we never receive that page and the thread stays stuck in the 
> userfault.
>      We can't issue a recovery command to reopen a network connection
>      because the monitor is blocked.
>   b) Postcopy: A monitor command either accesses guest memory or has
>      to wait on another thread that is doing; e.g. info cpu  waits
>      for the CPU threads to exit the loop, but they might be blocked
>      waiting on userfault.
>   c) COLO or migration: The network fails during the critical bit
>      at the end of migration when we have the bql held.  You can't
>      issue a migration_cancel or a colo-failover via the monitor
>      because it's blocked.

Thanks for explainaing!

What commands are in class (c)? From the cover letter it seems migrate-incoming
is the only one in mind, I'm not sure how it resolves any of the three
scenarios?

> 
> There are other advantages of being able to do bql'less commands;
> things like an 'info status' or the like should be doable without bql,
> so just avoding taking the bql when the management layer is doing
> stuff (or alternatively getting faster replies on management)
> are both useful.

Agreed. It is very useful not just for migration.

Fam



reply via email to

[Prev in Thread] Current Thread [Next in Thread]