qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 00/25] migration: Postcopy Preemption


From: Dr. David Alan Gilbert
Subject: Re: [PATCH v2 00/25] migration: Postcopy Preemption
Date: Wed, 2 Mar 2022 12:14:30 +0000
User-agent: Mutt/2.1.5 (2021-12-30)

* Peter Xu (peterx@redhat.com) wrote:
> This is v2 of postcopy preempt series.  It can also be found here:
> 
>   https://github.com/xzpeter/qemu/tree/postcopy-preempt
> 
> RFC: 
> https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com
> V1:  
> https://lore.kernel.org/qemu-devel/20220216062809.57179-1-peterx@redhat.com

I've queued some of this:

tests: Pass in MigrateStart** into test_migrate_start()
migration: Add migration_incoming_transport_cleanup()
migration: postcopy_pause_fault_thread() never fails
migration: Enlarge postcopy recovery to capture !-EIO too
migration: Move static var in ram_block_from_stream() into global
migration: Add postcopy_thread_create()
migration: Dump ramblock and offset too when non-same-page detected
migration: Introduce postcopy channels on dest node
migration: Tracepoint change in postcopy-run bottom half
migration: Finer grained tracepoints for POSTCOPY_LISTEN
migration: Dump sub-cmd name in loadvm_process_command tp

> v1->v2 changelog:
> - Picked up more r-bs from Dave
> - Rename both fault threads to drop "qemu/" prefix [Dave]
> - Further rework on postcopy recovery, to be able to detect qemufile errors
>   from either main channel or postcopy one [Dave]
> - shutdown() qemufile before close on src postcopy channel when postcopy is
>   paused [Dave]
> - In postcopy_preempt_new_channel(), explicitly set the new channel in
>   blocking state, even if it's the default [Dave]
> - Make RAMState.postcopy_channel unsigned int [Dave]
> - Added patches:
>   - "migration: Create the postcopy preempt channel asynchronously"
>   - "migration: Parameter x-postcopy-preempt-break-huge"
>   - "migration: Add helpers to detect TLS capability"
>   - "migration: Fail postcopy preempt with TLS"
>   - "tests: Pass in MigrateStart** into test_migrate_start()"
> 
> Abstract
> ========
> 
> This series added a new migration capability called "postcopy-preempt".  It 
> can
> be enabled when postcopy is enabled, and it'll simply (but greatly) speed up
> postcopy page requests handling process.
> 
> Below are some initial postcopy page request latency measurements after the
> new series applied.
> 
> For each page size, I measured page request latency for three cases:
> 
>   (a) Vanilla:                the old postcopy
>   (b) Preempt no-break-huge:  preempt enabled, 
> x-postcopy-preempt-break-huge=off
>   (c) Preempt full:           preempt enabled, 
> x-postcopy-preempt-break-huge=on
>                               (this is the default option when preempt 
> enabled)
> 
> Here x-postcopy-preempt-break-huge parameter is just added in v2 so as to
> conditionally disable the behavior to break sending a precopy huge page for
> debugging purpose.  So when it's off, postcopy will not preempt precopy
> sending a huge page, but still postcopy will use its own channel.
> 
> I tested it separately to give a rough idea on which part of the change
> helped how much of it.  The overall benefit should be the comparison
> between case (a) and (c).
> 
>   |-----------+---------+-----------------------+--------------|
>   | Page size | Vanilla | Preempt no-break-huge | Preempt full |
>   |-----------+---------+-----------------------+--------------|
>   | 4K        |   10.68 |               N/A [*] |         0.57 |
>   | 2M        |   10.58 |                  5.49 |         5.02 |
>   | 1G        | 2046.65 |               933.185 |      649.445 |
>   |-----------+---------+-----------------------+--------------|
>   [*]: This case is N/A because 4K page does not contain huge page at all
> 
> [1] 
> https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf
> 
> TODO List
> =========
> 
> TLS support
> -----------
> 
> I only noticed its missing very recently.  Since soft freeze is coming, and
> obviously I'm still growing this series, so I tend to have the existing
> material discussed. Let's see if it can still catch the train for QEMU 7.0
> release (soft freeze on 2022-03-08)..
> 
> Avoid precopy write() blocks postcopy
> -------------------------------------
> 
> I didn't prove this, but I always think the write() syscalls being blocked
> for precopy pages can affect postcopy services.  If we can solve this
> problem then my wild guess is we can further reduce the average page
> latency.
> 
> Two solutions at least in mind: (1) we could have made the write side of
> the migration channel NON_BLOCK too, or (2) multi-threads on send side,
> just like multifd, but we may use lock to protect which page to send too
> (e.g., the core idea is we should _never_ rely anything on the main thread,
> multifd has that dependency on queuing pages only on main thread).
> 
> That can definitely be done and thought about later.
> 
> Multi-channel for preemption threads
> ------------------------------------
> 
> Currently the postcopy preempt feature use only one extra channel and one
> extra thread on dest (no new thread on src QEMU).  It should be mostly good
> enough for major use cases, but when the postcopy queue is long enough
> (e.g. hundreds of vCPUs faulted on different pages) logically we could
> still observe more delays in average.  Whether growing threads/channels can
> solve it is debatable, but sounds worthwhile a try.  That's yet another
> thing we can think about after this patchset lands.
> 
> Logically the design provides space for that - the receiving postcopy
> preempt thread can understand all ram-layer migration protocol, and for
> multi channel and multi threads we could simply grow that into multile
> threads handling the same protocol (with multiple PostcopyTmpPage).  The
> source needs more thoughts on synchronizations, though, but it shouldn't
> affect the whole protocol layer, so should be easy to keep compatible.
> 
> Patch Layout
> ============
> 
> Patch 1-3: Three leftover patches from patchset "[PATCH v3 0/8] migration:
> Postcopy cleanup on ram disgard" that I picked up here too.
> 
>   https://lore.kernel.org/qemu-devel/20211224065000.97572-1-peterx@redhat.com/
> 
>   migration: Dump sub-cmd name in loadvm_process_command tp
>   migration: Finer grained tracepoints for POSTCOPY_LISTEN
>   migration: Tracepoint change in postcopy-run bottom half
> 
> Patch 4-9: Original postcopy preempt RFC preparation patches (with slight
> modifications).
> 
>   migration: Introduce postcopy channels on dest node
>   migration: Dump ramblock and offset too when non-same-page detected
>   migration: Add postcopy_thread_create()
>   migration: Move static var in ram_block_from_stream() into global
>   migration: Add pss.postcopy_requested status
>   migration: Move migrate_allow_multifd and helpers into migration.c
> 
> Patch 10-15: Some newly added patches when working on postcopy recovery
> support.  After these patches migrate-recover command will allow re-entrance,
> which is a very nice side effect.
> 
>   migration: Enlarge postcopy recovery to capture !-EIO too
>   migration: postcopy_pause_fault_thread() never fails
>   migration: Export ram_load_postcopy()
>   migration: Move channel setup out of postcopy_try_recover()
>   migration: Add migration_incoming_transport_cleanup()
>   migration: Allow migrate-recover to run multiple times
> 
> Patch 16-19: The major work of postcopy preemption implementation is split 
> into
> four patches as suggested by Dave.
> 
>   migration: Add postcopy-preempt capability
>   migration: Postcopy preemption preparation on channel creation
>   migration: Postcopy preemption enablement
>   migration: Postcopy recover with preempt enabled
> 
> Patch 20-23: Newly added patches in this v2 for different purposes.
>              Majorly some amendment on existing postcopy preempt.
> 
>   migration: Create the postcopy preempt channel asynchronously
>   migration: Parameter x-postcopy-preempt-break-huge
>   migration: Add helpers to detect TLS capability
>   migration: Fail postcopy preempt with TLS for now
> 
> Patch 24-25: Test cases (including one more patch for cleanup)
> 
>   tests: Add postcopy preempt test
>   tests: Pass in MigrateStart** into test_migrate_start()
> 
> Please review, thanks.
> 
> Peter Xu (25):
>   migration: Dump sub-cmd name in loadvm_process_command tp
>   migration: Finer grained tracepoints for POSTCOPY_LISTEN
>   migration: Tracepoint change in postcopy-run bottom half
>   migration: Introduce postcopy channels on dest node
>   migration: Dump ramblock and offset too when non-same-page detected
>   migration: Add postcopy_thread_create()
>   migration: Move static var in ram_block_from_stream() into global
>   migration: Add pss.postcopy_requested status
>   migration: Move migrate_allow_multifd and helpers into migration.c
>   migration: Enlarge postcopy recovery to capture !-EIO too
>   migration: postcopy_pause_fault_thread() never fails
>   migration: Export ram_load_postcopy()
>   migration: Move channel setup out of postcopy_try_recover()
>   migration: Add migration_incoming_transport_cleanup()
>   migration: Allow migrate-recover to run multiple times
>   migration: Add postcopy-preempt capability
>   migration: Postcopy preemption preparation on channel creation
>   migration: Postcopy preemption enablement
>   migration: Postcopy recover with preempt enabled
>   migration: Create the postcopy preempt channel asynchronously
>   migration: Parameter x-postcopy-preempt-break-huge
>   migration: Add helpers to detect TLS capability
>   migration: Fail postcopy preempt with TLS for now
>   tests: Add postcopy preempt test
>   tests: Pass in MigrateStart** into test_migrate_start()
> 
>  migration/channel.c          |  10 +-
>  migration/migration.c        | 235 ++++++++++++++++++++-----
>  migration/migration.h        |  98 ++++++++++-
>  migration/multifd.c          |  26 +--
>  migration/multifd.h          |   2 -
>  migration/postcopy-ram.c     | 244 +++++++++++++++++++++-----
>  migration/postcopy-ram.h     |  15 ++
>  migration/qemu-file.c        |  27 +++
>  migration/qemu-file.h        |   1 +
>  migration/ram.c              | 330 +++++++++++++++++++++++++++++++----
>  migration/ram.h              |   3 +
>  migration/savevm.c           |  70 ++++++--
>  migration/socket.c           |  22 ++-
>  migration/socket.h           |   1 +
>  migration/trace-events       |  19 +-
>  qapi/migration.json          |   8 +-
>  tests/qtest/migration-test.c |  68 ++++++--
>  17 files changed, 983 insertions(+), 196 deletions(-)
> 
> -- 
> 2.32.0
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]