[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 2/5] migration: Allow network to fail even during recovery
|
From: |
Juan Quintela |
|
Subject: |
Re: [PATCH v4 2/5] migration: Allow network to fail even during recovery |
|
Date: |
Tue, 31 Oct 2023 15:26:42 +0100 |
|
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/28.3 (gnu/linux) |
Peter Xu <peterx@redhat.com> wrote:
> Normally the postcopy recover phase should only exist for a super short
> period, that's the duration when QEMU is trying to recover from an
> interrupted postcopy migration, during which handshake will be carried out
> for continuing the procedure with state changes from PAUSED -> RECOVER ->
> POSTCOPY_ACTIVE again.
>
> Here RECOVER phase should be super small, that happens right after the
> admin specified a new but working network link for QEMU to reconnect to
> dest QEMU.
>
> However there can still be case where the channel is broken in this small
> RECOVER window.
>
> If it happens, with current code there's no way the src QEMU can got kicked
> out of RECOVER stage. No way either to retry the recover in another channel
> when established.
>
> This patch allows the RECOVER phase to fail itself too - we're mostly
> ready, just some small things missing, e.g. properly kick the main
> migration thread out when sleeping on rp_sem when we found that we're at
> RECOVER stage. When this happens, it fails the RECOVER itself, and
> rollback to PAUSED stage. Then the user can retry another round of
> recovery.
>
> To make it even stronger, teach QMP command migrate-pause to explicitly
> kick src/dst QEMU out when needed, so even if for some reason the migration
> thread didn't got kicked out already by a failing rethrn-path thread, the
> admin can also kick it out.
>
> This will be an super, super corner case, but still try to cover that.
>
> One can try to test this with two proxy channels for migration:
>
> (a) socat unix-listen:/tmp/src.sock,reuseaddr,fork tcp:localhost:10000
> (b) socat tcp-listen:10000,reuseaddr,fork unix:/tmp/dst.sock
>
> So the migration channel will be:
>
> (a) (b)
> src -> /tmp/src.sock -> tcp:10000 -> /tmp/dst.sock -> dst
>
> Then to make QEMU hang at RECOVER stage, one can do below:
>
> (1) stop the postcopy using QMP command postcopy-pause
> (2) kill the 2nd proxy (b)
> (3) try to recover the postcopy using /tmp/src.sock on src
> (4) src QEMU will go into RECOVER stage but won't be able to continue
> from there, because the channel is actually broken at (b)
>
> Before this patch, step (4) will make src QEMU stuck in RECOVER stage,
> without a way to kick the QEMU out or continue the postcopy again. After
> this patch, (4) will quickly fail qemu and bounce back to PAUSED stage.
>
> Admin can also kick QEMU from (4) into PAUSED when needed using
> migrate-pause when needed.
>
> After bouncing back to PAUSED stage, one can recover again.
>
> Reported-by: Xiaohui Li <xiaohli@redhat.com>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111332
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
> -void migration_rp_wait(MigrationState *s)
> +int migration_rp_wait(MigrationState *s)
> {
> + /* If migration has failure already, ignore the wait */
> + if (migrate_has_error(s)) {
> + return -1;
> + }
> +
> qemu_sem_wait(&s->rp_state.rp_sem);
> +
> + /* After wait, double check that there's no failure */
> + if (migrate_has_error(s)) {
> + return -1;
> + }
> +
> + return 0;
> }
Shouldn't this be bool?
We have (too many) functions in migration that returns 0/-1 and set an
error, I think we should change them to return bool. Or even just test
if err is set.
Later, Juan.
- [PATCH v4 0/5] migration: Better error handling in rp thread, allow failures in recover, Peter Xu, 2023/10/17
- [PATCH v4 1/5] migration: Refactor error handling in source return path, Peter Xu, 2023/10/17
- [PATCH v4 2/5] migration: Allow network to fail even during recovery, Peter Xu, 2023/10/17
- Re: [PATCH v4 2/5] migration: Allow network to fail even during recovery,
Juan Quintela <=
- [PATCH v4 3/5] tests/migration-test: Add a test for postcopy hangs during RECOVER, Peter Xu, 2023/10/17
- [PATCH v4 4/5] migration: Change ram_dirty_bitmap_reload() retval to bool, Peter Xu, 2023/10/17
- [PATCH v4 5/5] migration: Change ram_save_queue_pages() retval to bool, Peter Xu, 2023/10/17