[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_L
|
From: |
Wang, Wei W |
|
Subject: |
RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN |
|
Date: |
Tue, 2 Apr 2024 07:25:51 +0000 |
On Tuesday, April 2, 2024 2:56 PM, Wang, Lei4 wrote:
> On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000,
> Wang, Wei W wrote:
> >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
> >>> When using the post-copy preemption feature to perform post-copy
> >>> live migration, the below scenario could lead to a deadlock and the
> >>> migration will never finish:
> >>>
> >>> - Source connect() the preemption channel in postcopy_start().
> >>> - Source and the destination side TCP stack finished the 3-way handshake
> >>> thus the connection is successful.
> >>> - The destination side main thread is handling the loading of the bulk
> RAM
> >>> pages thus it doesn't start to handle the pending connection event in
> >>> the
> >>> event loop. and doesn't post the semaphore
> postcopy_qemufile_dst_done for
> >>> the preemption thread.
> >>> - The source side sends non-iterative device states, such as the virtio
> >>> states.
> >>> - The destination main thread starts to receive the virtio states, this
> >>> process may lead to a page fault (e.g.,
> >>> virtio_load()->vring_avail_idx()
> >>> may trigger a page fault since the avail ring page may not be received
> >>> yet).
> >
> > Ouch. Yeah I think this part got overlooked when working on the
> > preempt channel.
> >
> >>> - The page request is sent back to the source side. Source sends the page
> >>> content to the destination side preemption thread.
> >>> - Since the event is not arrived and the semaphore
> >>> postcopy_qemufile_dst_done is not posted, the preemption thread in
> >>> destination side is blocked, and cannot handle receiving the page.
> >>> - The QEMU main load thread on the destination side is stuck at the page
> >>> fault, and cannot yield and handle the connect() event for the
> >>> preemption channel to unblock the preemption thread.
> >>> - The postcopy will stuck there forever since this is a deadlock.
> >>>
> >>> The key point to reproduce this bug is that the source side is
> >>> sending pages at a rate faster than the destination handling,
> >>> otherwise, the qemu_get_be64() in
> >>> ram_load_precopy() will have a chance to yield since at that time
> >>> there are no pending data in the buffer to get. This will make this
> >>> bug harder to be reproduced.
> >
> > How hard would this reproduce?
>
> We can manually make this easier to reproduce by adding the following code
> to make the destination busier to load the pages:
>
> diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f
> 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f) {
> MigrationIncomingState *mis = migration_incoming_get_current();
> int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
> + volatile unsigned long long a;
>
> if (!migrate_compress()) {
> invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6
> +4348,7 @@ static int ram_load_precopy(QEMUFile *f)
> break;
>
> case RAM_SAVE_FLAG_PAGE:
> + for (a = 0; a < 100000000; a++);
> qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
> break;
>
Which version of QEMU are you using?
I tried with the latest upstream QEMU (e.g. v8.2.0 release, 1600b9f46b1bd), it's
always reproducible without any changes (with local migration tests).
> >
> > I'm thinking whether this should be 9.0 material or 9.1. It's pretty
> > late for 9.0 though, but we can still discuss.
> >
> >>>
> >>> Fix this by yielding the load coroutine when receiving
> >>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
> >>> connection event before loading the non-iterative devices state to
> >>> avoid the deadlock condition.
> >>>
> >>> Signed-off-by: Lei Wang <lei4.wang@intel.com>
> >>
> >> This seems to be a regression issue caused by this commit:
> >> 737840e2c6ea (migration: Use the number of transferred bytes
> >> directly)
> >>
> >> Adding qemu_fflush back to migration_rate_exceeded() or
> >> ram_save_iterate seems to work (might not be a good fix though).
> >>
> >>> ---
> >>> migration/savevm.c | 5 +++++
> >>> 1 file changed, 5 insertions(+)
> >>>
> >>> diff --git a/migration/savevm.c b/migration/savevm.c index
> >>> e386c5267f..8fd4dc92f2 100644
> >>> --- a/migration/savevm.c
> >>> +++ b/migration/savevm.c
> >>> @@ -2445,6 +2445,11 @@ static int
> loadvm_process_command(QEMUFile *f)
> >>> return loadvm_postcopy_handle_advise(mis, len);
> >>>
> >>> case MIG_CMD_POSTCOPY_LISTEN:
> >>> + if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
> >>> + aio_co_schedule(qemu_get_current_aio_context(),
> >>> + qemu_coroutine_self());
> >>> + qemu_coroutine_yield();
> >>> + }
> >>
> >> The above could be moved to loadvm_postcopy_handle_listen().
> >
> > I'm not 100% sure such thing (no matter here or moved into it, which
> > does look cleaner) would work for us.
> >
> > The problem is I still don't yet see an ordering restricted on top of
> > (1)
> > accept() happens, and (2) receive LISTEN cmd here. What happens if
> > the
> > accept() request is not yet received when reaching LISTEN? Or is it
> > always guaranteed the accept(fd) will always be polled here?
> >
> > For example, the source QEMU (no matter pre-7.2 or later) will always
> > setup the preempt channel asynchrounously, then IIUC it can connect()
> > after sending the whole chunk of packed data which should include this
> > LISTEN. I think it means it's not guaranteed this will 100% work, but
> > maybe further reduce the possibility of the race.
>
> I think the following code:
>
> postcopy_start() ->
> postcopy_preempt_establish_channel() ->
> qemu_sem_wait(&s->postcopy_qemufile_src_sem);
>
> can guarantee that the connect() syscall is successful so the destination side
> receives the connect() request before it loads the LISTEN command, otherwise
> it won't post the sem:
>
> postcopy_preempt_send_channel_new() ->
> postcopy_preempt_send_channel_done() ->
> qemu_sem_post(&s->postcopy_qemufile_src_sem);
>
Yes. But as mentioned in another thread, connect() and accept() are async.
So in theory accept() could still come later after the LISTEN command.
> >
> > One right fix that I can think of is moving the sem_wait(&done) into
> > the main thread too, so we wait for the sem _before_ reading the
> > packed data, so there's no chance of fault. However I don't think
> > sem_wait() will be smart enough to yield when in a coroutine.. In the
> > long term run I think we should really make migration loadvm to do
> > work in the thread rather than the main thread. I think it means we
> > have one more example to be listed in this todo so that's preferred..
> >
> > https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration
> > _destination
> >
> > I attached such draft patch below, but I'm not sure it'll work. Let
> > me know how both of you think about it.
>
> Sadly it doesn't work, there is an unknown segfault.
>
> >
> >>
> >> Another option is to follow the old way (i.e. pre_7_2) to do
> >> postcopy_preempt_setup in migrate_fd_connect. This can save the above
> >> overhead of switching to the main thread during the downtime. Seems
> >> Peter's previous patch already solved the channel disordering issue. Let's
> see Peter and others' opinions.
> >
> > IIUC we still need that pre_7_2 stuff and keep the postponed connect()
> > to make sure the ordering is done properly. Wei, could you elaborate
> > the patch you mentioned? Maybe I missed some spots.
> >
> > You raised a good point that this may introduce higher downtime. Did
> > you or Lei tried to measure how large it is? If that is too high, we
> > may need to think another solution, e.g., wait the channel connection
> > before vm stop happens.
>
> Per my very simple test, using post-copy preemption to live migrate an 8G VM:
>
> w/o this patch: 121ms in avg in 5 tries
> w/ this patch: 115ms in avg in 5 tries
>
> So it seems the overhead introduced is not too high (maybe ignorable?).
You could just measure the time for the added qemu_coroutine_yield() part.
The time will depend on how many events happen to be there waiting for a
dispatch.
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/01
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Fabiano Rosas, 2024/04/01
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Lei, 2024/04/02
- RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN,
Wang, Wei W <=
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Lei, 2024/04/02
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/02
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Lei, 2024/04/03
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/03
- RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Wei W, 2024/04/03
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/03
- RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Wei W, 2024/04/04
RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Wei W, 2024/04/02