[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread t
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever |
Date: |
Mon, 29 Nov 2021 14:50:21 +0000 |
User-agent: |
Mutt/2.1.3 (2021-09-10) |
* Li Zhang (lizhang@suse.de) wrote:
>
> On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
> > > > When doing live migration with multifd channels 8, 16 or larger number,
> > > > the guest hangs in the presence of the network errors such as missing
> > > > TCP ACKs.
> > > >
> > > > At sender's side:
> > > > The main thread is blocked on qemu_thread_join, migration_fd_cleanup
> > > > is called because one thread fails on qio_channel_write_all when
> > > > the network problem happens and other send threads are blocked on
> > > > sendmsg.
> > > > They could not be terminated. So the main thread is blocked on
> > > > qemu_thread_join
> > > > to wait for the threads terminated.
> > > Isn't the right answer here to ensure we've called 'shutdown' on
> > > all the FDs, so that the threads get kicked out of sendmsg, before
> > > trying to join the thread ?
> > I agree a timeout is wrong here; there is no way to get a good timeout
> > value.
> > However, I'm a bit confused - we should be able to try a shutdown on the
> > receive side using the 'yank' command. - that's what it's there for; Li
> > does this solve your problem?
>
> No, I tried to register 'yank' on the receive side, the receive threads are
> still waiting there.
>
> It seems that on send side, 'yank' doesn't work either when the send threads
> are blocked.
>
> This may be not the case to call yank. I am not quite sure about it.
We need to fix that; 'yank' should be able to recover from any network
issue. If it's not working we need to understand why.
> >
> > multifd_load_cleanup already kicks sem_sync before trying to do a
> > thread_join - so have we managed to trigger that on the receive side?
>
> There is no problem with sem_sync in function multifd_load_cleanup.
>
> But it is not called in my case, because no errors are detected on the
> receive side.
If you're getting TCP errors why aren't you seeing any errors on the
receive side?
> The problem is here:
>
> void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
> {
> MigrationIncomingState *mis = migration_incoming_get_current();
> Error *local_err = NULL;
> bool start_migration;
>
> ...
>
> if (!mis->from_src_file) {
>
> ...
>
> } else {
> /* Multiple connections */
> assert(migrate_use_multifd());
> start_migration = multifd_recv_new_channel(ioc, &local_err);
> if (local_err) {
> error_propagate(errp, local_err);
> return;
> }
> }
> if (start_migration) {
> migration_incoming_process();
> }
> }
>
> start_migration is always 0, and migration is not started because some
> receive threads are not created.
>
> No errors are detected here and the main process works well but receive
> threads are all waiting for semaphore.
>
> It's hard to know if the receive threads are not created. If we can find a
> way to check if any receive threads
So is this only a problem for network issues that happen during startup,
before all the threads have been created?
Dave
> are not created, we can kick the sem_sync and do cleanup.
>
> From the source code, the thread will be created when QIO channel detects
> something by GIO watch if I understand correctly.
>
> If nothing is detected, socket_accept_icoming_migration won't be called, the
> thread will not be created.
>
> socket_start_incoming_migration_internal ->
>
> qio_net_listener_set_client_func_full(listener,
> socket_accept_incoming_migration,
> NULL, NULL,
> g_main_context_get_thread_default());
>
> qio_net_listener_set_client_func_full ->
>
> qio_channel_add_watch_source(
> QIO_CHANNEL(listener->sioc[i]), G_IO_IN,
> qio_net_listener_channel_func,
> listener, (GDestroyNotify)object_unref, context);
>
> socket_accept_incoming_migration ->
>
> migration_channel_process_incoming ->
>
> migration_ioc_process_incoming ->
>
> multifd_recv_new_channel ->
>
> qemu_thread_create(&p->thread, p->name,
> multifd_recv_thread, p,
> QEMU_THREAD_JOINABLE);
>
> >
> > Dave
> >
> > > Regards,
> > > Daniel
> > > --
> > > |: https://berrange.com -o-
> > > https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org -o-
> > > https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org -o-
> > > https://www.instagram.com/dberrange :|
> > >
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
- [PATCH 0/2] migration: multifd live migration improvement, Li Zhang, 2021/11/26
- [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Li Zhang, 2021/11/26
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Daniel P . Berrangé, 2021/11/26
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Li Zhang, 2021/11/26
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Daniel P . Berrangé, 2021/11/26
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Li Zhang, 2021/11/26
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Daniel P . Berrangé, 2021/11/26
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Li Zhang, 2021/11/26
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Dr. David Alan Gilbert, 2021/11/29
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Li Zhang, 2021/11/29
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever,
Dr. David Alan Gilbert <=
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Li Zhang, 2021/11/29
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Daniel P . Berrangé, 2021/11/29
- Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Dr. David Alan Gilbert, 2021/11/29
Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever, Juan Quintela, 2021/11/26
[PATCH 2/2] migration: Set the socket backlog number to reduce the chance of live migration failure, Li Zhang, 2021/11/26