qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread t


From: Daniel P . Berrangé
Subject: Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever
Date: Wed, 1 Dec 2021 14:09:56 +0000
User-agent: Mutt/2.1.3 (2021-09-10)

On Wed, Dec 01, 2021 at 02:42:04PM +0100, Li Zhang wrote:
> 
> On 12/1/21 1:22 PM, Daniel P. Berrangé wrote:
> > On Wed, Dec 01, 2021 at 01:11:13PM +0100, Li Zhang wrote:
> > > On 11/29/21 3:50 PM, Dr. David Alan Gilbert wrote:
> > > > * Li Zhang (lizhang@suse.de) wrote:
> > > > > On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote:
> > > > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > > > On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
> > > > > > > > When doing live migration with multifd channels 8, 16 or larger 
> > > > > > > > number,
> > > > > > > > the guest hangs in the presence of the network errors such as 
> > > > > > > > missing TCP ACKs.
> > > > > > > > 
> > > > > > > > At sender's side:
> > > > > > > > The main thread is blocked on qemu_thread_join, 
> > > > > > > > migration_fd_cleanup
> > > > > > > > is called because one thread fails on qio_channel_write_all when
> > > > > > > > the network problem happens and other send threads are blocked 
> > > > > > > > on sendmsg.
> > > > > > > > They could not be terminated. So the main thread is blocked on 
> > > > > > > > qemu_thread_join
> > > > > > > > to wait for the threads terminated.
> > > > > > > Isn't the right answer here to ensure we've called 'shutdown' on
> > > > > > > all the FDs, so that the threads get kicked out of sendmsg, before
> > > > > > > trying to join the thread ?
> > > > > > I agree a timeout is wrong here; there is no way to get a good 
> > > > > > timeout
> > > > > > value.
> > > > > > However, I'm a bit confused - we should be able to try a shutdown 
> > > > > > on the
> > > > > > receive side using the 'yank' command. - that's what it's there 
> > > > > > for; Li
> > > > > > does this solve your problem?
> > > > > No, I tried to register 'yank' on the receive side, the receive 
> > > > > threads are
> > > > > still waiting there.
> > > > > 
> > > > > It seems that on send side, 'yank' doesn't work either when the send 
> > > > > threads
> > > > > are blocked.
> > > > > 
> > > > > This may be not the case to call yank. I am not quite sure about it.
> > > > We need to fix that; 'yank' should be able to recover from any network
> > > > issue.  If it's not working we need to understand why.
> > > Hi Dr. David,
> > > 
> > > On the receive side, I register 'yank' and it is called. But it is just to
> > > shut down the channels,
> > > 
> > > it couldn't fix the problem of the receive threads which are waiting for 
> > > the
> > > semaphore.
> > > 
> > > So the receive threads are still waiting there.
> > > 
> > > On the send side,  the main process is blocked on qemu_thread_join(), 
> > > when I
> > > tried the 'yank'
> > > 
> > > command with QMP,  it is not handled. So the QMP doesn't work and yank
> > > doesn't work.
> > IOW, there is a bug in QEMU on the send side. It should not be calling
> > qemu_thread_join() from the main thread, unless it is extremely
> > confident that the thread in question has already finished.
> > 
> > You seem to be showing that the thread(s) are still running, so we
> > need to understand why that is the case, and why the main thread
> > still decided to try to join these threads which haven't finished.
> 
> Some threads are running. But there is one thread fails to
> qio_channel_write_all.
> 
> In migration_thread(), it detects an error here:
> 
>    thr_error = migration_detect_error(s);
>         if (thr_error == MIG_THR_ERR_FATAL) {
>             /* Stop migration */
>             break;
> 
> It will stop migration and cleanup.

Those threads which are still running need to be made to
terminate before trying to join them

A quick glance at multifd_send_terminate_threads() makes me
suspect multifd shutdown is not reliable.

It is merely setting some boolean flags and posting to a
semaphore. It is doing nothing to shutdown the socket
associated with each thread, so the threads can still be
waiting in an I/O call. IMHO multifd_send_terminate_threads
needs to call qio_chanel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH)


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]