qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 18/20] migration: Postcopy preemption enablement


From: Dr. David Alan Gilbert
Subject: Re: [PATCH 18/20] migration: Postcopy preemption enablement
Date: Tue, 22 Feb 2022 10:52:23 +0000
User-agent: Mutt/2.1.5 (2021-12-30)

* Peter Xu (peterx@redhat.com) wrote:
> This patch enables postcopy-preempt feature.
> 
> It contains two major changes to the migration logic:
> 
> (1) Postcopy requests are now sent via a different socket from precopy
>     background migration stream, so as to be isolated from very high page
>     request delays.
> 
> (2) For huge page enabled hosts: when there's postcopy requests, they can now
>     intercept a partial sending of huge host pages on src QEMU.
> 
> After this patch, we'll live migrate a VM with two channels for postcopy: (1)
> PRECOPY channel, which is the default channel that transfers background pages;
> and (2) POSTCOPY channel, which only transfers requested pages.
> 
> There's no strict rule of which channel to use, e.g., if a requested page is
> already being transferred on precopy channel, then we will keep using the same
> precopy channel to transfer the page even if it's explicitly requested.  In 
> 99%
> of the cases we'll prioritize the channels so we send requested page via the
> postcopy channel as long as possible.
> 
> On the source QEMU, when we found a postcopy request, we'll interrupt the
> PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
> After we serviced all the high priority postcopy pages, we'll switch back to
> PRECOPY channel so that we'll continue to send the interrupted huge page 
> again.
> There's no new thread introduced on src QEMU.
> 
> On the destination QEMU, one new thread is introduced to receive page data 
> from
> the postcopy specific socket (done in the preparation patch).
> 
> This patch has a side effect: after sending postcopy pages, previously we'll
> assume the guest will access follow up pages so we'll keep sending from there.
> Now it's changed.  Instead of going on with a postcopy requested page, we'll 
> go
> back and continue sending the precopy huge page (which can be intercepted by a
> postcopy request so the huge page can be sent partially before).
> 
> Whether that's a problem is debatable, because "assuming the guest will
> continue to access the next page" may not really suite when huge pages are
> used, especially if the huge page is large (e.g. 1GB pages).  So that locality
> hint is much meaningless if huge pages are used.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>


This does get a bit complicated, which worries me a bit; the code here
is already quite complicated.
(If you repost, there are a few 'channel' variables that could probably
be 'unsigned' rather than int)

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c  |   2 +
>  migration/migration.h  |   2 +-
>  migration/ram.c        | 247 +++++++++++++++++++++++++++++++++++++++--
>  migration/trace-events |   7 ++
>  4 files changed, 249 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 3d7f897b72..d20db04097 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3153,6 +3153,8 @@ static int postcopy_start(MigrationState *ms)
>                                MIGRATION_STATUS_FAILED);
>      }
>  
> +    trace_postcopy_preempt_enabled(migrate_postcopy_preempt());
> +
>      return ret;
>  
>  fail_closefb:
> diff --git a/migration/migration.h b/migration/migration.h
> index caa910d956..b8aacfe3af 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -68,7 +68,7 @@ typedef struct {
>  struct MigrationIncomingState {
>      QEMUFile *from_src_file;
>      /* Previously received RAM's RAMBlock pointer */
> -    RAMBlock *last_recv_block;
> +    RAMBlock *last_recv_block[RAM_CHANNEL_MAX];
>      /* A hook to allow cleanup at the end of incoming migration */
>      void *transport_data;
>      void (*transport_cleanup)(void *data);
> diff --git a/migration/ram.c b/migration/ram.c
> index 36e3da6bb0..ffbe9a9a50 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -294,6 +294,20 @@ struct RAMSrcPageRequest {
>      QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
>  };
>  
> +typedef struct {
> +    /*
> +     * Cached ramblock/offset values if preempted.  They're only meaningful 
> if
> +     * preempted==true below.
> +     */
> +    RAMBlock *ram_block;
> +    unsigned long ram_page;
> +    /*
> +     * Whether a postcopy preemption just happened.  Will be reset after
> +     * precopy recovered to background migration.
> +     */
> +    bool preempted;
> +} PostcopyPreemptState;
> +
>  /* State of RAM for migration */
>  struct RAMState {
>      /* QEMUFile used for this migration */
> @@ -348,6 +362,14 @@ struct RAMState {
>      /* Queue of outstanding page requests from the destination */
>      QemuMutex src_page_req_mutex;
>      QSIMPLEQ_HEAD(, RAMSrcPageRequest) src_page_requests;
> +
> +    /* Postcopy preemption informations */
> +    PostcopyPreemptState postcopy_preempt_state;
> +    /*
> +     * Current channel we're using on src VM.  Only valid if postcopy-preempt
> +     * is enabled.
> +     */
> +    int postcopy_channel;
>  };
>  typedef struct RAMState RAMState;
>  
> @@ -355,6 +377,11 @@ static RAMState *ram_state;
>  
>  static NotifierWithReturnList precopy_notifier_list;
>  
> +static void postcopy_preempt_reset(RAMState *rs)
> +{
> +    memset(&rs->postcopy_preempt_state, 0, sizeof(PostcopyPreemptState));
> +}
> +
>  /* Whether postcopy has queued requests? */
>  static bool postcopy_has_request(RAMState *rs)
>  {
> @@ -1946,6 +1973,55 @@ void ram_write_tracking_stop(void)
>  }
>  #endif /* defined(__linux__) */
>  
> +/*
> + * Check whether two addr/offset of the ramblock falls onto the same host 
> huge
> + * page.  Returns true if so, false otherwise.
> + */
> +static bool offset_on_same_huge_page(RAMBlock *rb, uint64_t addr1,
> +                                     uint64_t addr2)
> +{
> +    size_t page_size = qemu_ram_pagesize(rb);
> +
> +    addr1 = ROUND_DOWN(addr1, page_size);
> +    addr2 = ROUND_DOWN(addr2, page_size);
> +
> +    return addr1 == addr2;
> +}
> +
> +/*
> + * Whether a previous preempted precopy huge page contains current requested
> + * page?  Returns true if so, false otherwise.
> + *
> + * This should really happen very rarely, because it means when we were 
> sending
> + * during background migration for postcopy we're sending exactly the page 
> that
> + * some vcpu got faulted on on dest node.  When it happens, we probably don't
> + * need to do much but drop the request, because we know right after we 
> restore
> + * the precopy stream it'll be serviced.  It'll slightly affect the order of
> + * postcopy requests to be serviced (e.g. it'll be the same as we move 
> current
> + * request to the end of the queue) but it shouldn't be a big deal.  The most
> + * imporant thing is we can _never_ try to send a partial-sent huge page on 
> the
> + * POSTCOPY channel again, otherwise that huge page will got "split brain" on
> + * two channels (PRECOPY, POSTCOPY).
> + */
> +static bool postcopy_preempted_contains(RAMState *rs, RAMBlock *block,
> +                                        ram_addr_t offset)
> +{
> +    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
> +
> +    /* No preemption at all? */
> +    if (!state->preempted) {
> +        return false;
> +    }
> +
> +    /* Not even the same ramblock? */
> +    if (state->ram_block != block) {
> +        return false;
> +    }
> +
> +    return offset_on_same_huge_page(block, offset,
> +                                    state->ram_page << TARGET_PAGE_BITS);
> +}
> +
>  /**
>   * get_queued_page: unqueue a page from the postcopy requests
>   *
> @@ -1961,9 +2037,17 @@ static bool get_queued_page(RAMState *rs, 
> PageSearchStatus *pss)
>      RAMBlock  *block;
>      ram_addr_t offset;
>  
> +again:
>      block = unqueue_page(rs, &offset);
>  
> -    if (!block) {
> +    if (block) {
> +        /* See comment above postcopy_preempted_contains() */
> +        if (postcopy_preempted_contains(rs, block, offset)) {
> +            trace_postcopy_preempt_hit(block->idstr, offset);
> +            /* This request is dropped */
> +            goto again;
> +        }
> +    } else {
>          /*
>           * Poll write faults too if background snapshot is enabled; that's
>           * when we have vcpus got blocked by the write protected pages.
> @@ -2179,6 +2263,114 @@ static int ram_save_target_page(RAMState *rs, 
> PageSearchStatus *pss)
>      return ram_save_page(rs, pss);
>  }
>  
> +static bool postcopy_needs_preempt(RAMState *rs, PageSearchStatus *pss)
> +{
> +    /* Not enabled eager preempt?  Then never do that. */
> +    if (!migrate_postcopy_preempt()) {
> +        return false;
> +    }
> +
> +    /* If the ramblock we're sending is a small page?  Never bother. */
> +    if (qemu_ram_pagesize(pss->block) == TARGET_PAGE_SIZE) {
> +        return false;
> +    }
> +
> +    /* Not in postcopy at all? */
> +    if (!migration_in_postcopy()) {
> +        return false;
> +    }
> +
> +    /*
> +     * If we're already handling a postcopy request, don't preempt as this 
> page
> +     * has got the same high priority.
> +     */
> +    if (pss->postcopy_requested) {
> +        return false;
> +    }
> +
> +    /* If there's postcopy requests, then check it up! */
> +    return postcopy_has_request(rs);
> +}
> +
> +/* Returns true if we preempted precopy, false otherwise */
> +static void postcopy_do_preempt(RAMState *rs, PageSearchStatus *pss)
> +{
> +    PostcopyPreemptState *p_state = &rs->postcopy_preempt_state;
> +
> +    trace_postcopy_preempt_triggered(pss->block->idstr, pss->page);
> +
> +    /*
> +     * Time to preempt precopy. Cache current PSS into preempt state, so that
> +     * after handling the postcopy pages we can recover to it.  We need to do
> +     * so because the dest VM will have partial of the precopy huge page kept
> +     * over in its tmp huge page caches; better move on with it when we can.
> +     */
> +    p_state->ram_block = pss->block;
> +    p_state->ram_page = pss->page;
> +    p_state->preempted = true;
> +}
> +
> +/* Whether we're preempted by a postcopy request during sending a huge page 
> */
> +static bool postcopy_preempt_triggered(RAMState *rs)
> +{
> +    return rs->postcopy_preempt_state.preempted;
> +}
> +
> +static void postcopy_preempt_restore(RAMState *rs, PageSearchStatus *pss)
> +{
> +    PostcopyPreemptState *state = &rs->postcopy_preempt_state;
> +
> +    assert(state->preempted);
> +
> +    pss->block = state->ram_block;
> +    pss->page = state->ram_page;
> +    /* This is not a postcopy request but restoring previous precopy */
> +    pss->postcopy_requested = false;
> +
> +    trace_postcopy_preempt_restored(pss->block->idstr, pss->page);
> +
> +    /* Reset preempt state, most importantly, set preempted==false */
> +    postcopy_preempt_reset(rs);
> +}
> +
> +static void postcopy_preempt_choose_channel(RAMState *rs, PageSearchStatus 
> *pss)
> +{
> +    int channel = pss->postcopy_requested ? RAM_CHANNEL_POSTCOPY : 
> RAM_CHANNEL_PRECOPY;
> +    MigrationState *s = migrate_get_current();
> +    QEMUFile *next;
> +
> +    if (channel != rs->postcopy_channel) {
> +        if (channel == RAM_CHANNEL_PRECOPY) {
> +            next = s->to_dst_file;
> +        } else {
> +            next = s->postcopy_qemufile_src;
> +        }
> +        /* Update and cache the current channel */
> +        rs->f = next;
> +        rs->postcopy_channel = channel;
> +
> +        /*
> +         * If channel switched, reset last_sent_block since the old sent 
> block
> +         * may not be on the same channel.
> +         */
> +        rs->last_sent_block = NULL;
> +
> +        trace_postcopy_preempt_switch_channel(channel);
> +    }
> +
> +    trace_postcopy_preempt_send_host_page(pss->block->idstr, pss->page);
> +}
> +
> +/* We need to make sure rs->f always points to the default channel elsewhere 
> */
> +static void postcopy_preempt_reset_channel(RAMState *rs)
> +{
> +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
> +        rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
> +        rs->f = migrate_get_current()->to_dst_file;
> +        trace_postcopy_preempt_reset_channel();
> +    }
> +}
> +
>  /**
>   * ram_save_host_page: save a whole host page
>   *
> @@ -2210,7 +2402,16 @@ static int ram_save_host_page(RAMState *rs, 
> PageSearchStatus *pss)
>          return 0;
>      }
>  
> +    if (migrate_postcopy_preempt() && migration_in_postcopy()) {
> +        postcopy_preempt_choose_channel(rs, pss);
> +    }
> +
>      do {
> +        if (postcopy_needs_preempt(rs, pss)) {
> +            postcopy_do_preempt(rs, pss);
> +            break;
> +        }
> +
>          /* Check the pages is dirty and if it is send it */
>          if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
>              tmppages = ram_save_target_page(rs, pss);
> @@ -2234,6 +2435,19 @@ static int ram_save_host_page(RAMState *rs, 
> PageSearchStatus *pss)
>      /* The offset we leave with is the min boundary of host page and block */
>      pss->page = MIN(pss->page, hostpage_boundary);
>  
> +    /*
> +     * When with postcopy preempt mode, flush the data as soon as possible 
> for
> +     * postcopy requests, because we've already sent a whole huge page, so 
> the
> +     * dst node should already have enough resource to atomically filling in
> +     * the current missing page.
> +     *
> +     * More importantly, when using separate postcopy channel, we must do
> +     * explicit flush or it won't flush until the buffer is full.
> +     */
> +    if (migrate_postcopy_preempt() && pss->postcopy_requested) {
> +        qemu_fflush(rs->f);
> +    }
> +
>      res = ram_save_release_protection(rs, pss, start_page);
>      return (res < 0 ? res : pages);
>  }
> @@ -2275,8 +2489,17 @@ static int ram_find_and_save_block(RAMState *rs)
>          found = get_queued_page(rs, &pss);
>  
>          if (!found) {
> -            /* priority queue empty, so just search for something dirty */
> -            found = find_dirty_block(rs, &pss, &again);
> +            /*
> +             * Recover previous precopy ramblock/offset if postcopy has
> +             * preempted precopy.  Otherwise find the next dirty bit.
> +             */
> +            if (postcopy_preempt_triggered(rs)) {
> +                postcopy_preempt_restore(rs, &pss);
> +                found = true;
> +            } else {
> +                /* priority queue empty, so just search for something dirty 
> */
> +                found = find_dirty_block(rs, &pss, &again);
> +            }
>          }
>  
>          if (found) {
> @@ -2404,6 +2627,8 @@ static void ram_state_reset(RAMState *rs)
>      rs->last_page = 0;
>      rs->last_version = ram_list.version;
>      rs->xbzrle_enabled = false;
> +    postcopy_preempt_reset(rs);
> +    rs->postcopy_channel = RAM_CHANNEL_PRECOPY;
>  }
>  
>  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> @@ -3042,6 +3267,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>      }
>      qemu_mutex_unlock(&rs->bitmap_mutex);
>  
> +    postcopy_preempt_reset_channel(rs);
> +
>      /*
>       * Must occur before EOS (or any QEMUFile operation)
>       * because of RDMA protocol.
> @@ -3111,6 +3338,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          ram_control_after_iterate(f, RAM_CONTROL_FINISH);
>      }
>  
> +    postcopy_preempt_reset_channel(rs);
> +
>      if (ret >= 0) {
>          multifd_send_sync_main(rs->f);
>          qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> @@ -3193,11 +3422,13 @@ static int load_xbzrle(QEMUFile *f, ram_addr_t addr, 
> void *host)
>   * @mis: the migration incoming state pointer
>   * @f: QEMUFile where to read the data from
>   * @flags: Page flags (mostly to see if it's a continuation of previous 
> block)
> + * @channel: the channel we're using
>   */
>  static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
> -                                              QEMUFile *f, int flags)
> +                                              QEMUFile *f, int flags,
> +                                              int channel)
>  {
> -    RAMBlock *block = mis->last_recv_block;
> +    RAMBlock *block = mis->last_recv_block[channel];
>      char id[256];
>      uint8_t len;
>  
> @@ -3224,7 +3455,7 @@ static inline RAMBlock 
> *ram_block_from_stream(MigrationIncomingState *mis,
>          return NULL;
>      }
>  
> -    mis->last_recv_block = block;
> +    mis->last_recv_block[channel] = block;
>  
>      return block;
>  }
> @@ -3678,7 +3909,7 @@ int ram_load_postcopy(QEMUFile *f, int channel)
>          trace_ram_load_postcopy_loop((uint64_t)addr, flags);
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE)) {
> -            block = ram_block_from_stream(mis, f, flags);
> +            block = ram_block_from_stream(mis, f, flags, channel);
>              if (!block) {
>                  ret = -EINVAL;
>                  break;
> @@ -3929,7 +4160,7 @@ static int ram_load_precopy(QEMUFile *f)
>  
>          if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE |
>                       RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE)) {
> -            RAMBlock *block = ram_block_from_stream(mis, f, flags);
> +            RAMBlock *block = ram_block_from_stream(mis, f, flags, 
> RAM_CHANNEL_PRECOPY);
>  
>              host = host_from_ram_block_offset(block, addr);
>              /*
> diff --git a/migration/trace-events b/migration/trace-events
> index 4474a76614..b1155d9da0 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -111,6 +111,12 @@ ram_load_complete(int ret, uint64_t seq_iter) "exit_code 
> %d seq iteration %" PRI
>  ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, 
> void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>  ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, 
> void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>  unqueue_page(char *block, uint64_t offset, bool dirty) "ramblock '%s' offset 
> 0x%"PRIx64" dirty %d"
> +postcopy_preempt_triggered(char *str, unsigned long page) "during sending 
> ramblock %s offset 0x%lx"
> +postcopy_preempt_restored(char *str, unsigned long page) "ramblock %s offset 
> 0x%lx"
> +postcopy_preempt_hit(char *str, uint64_t offset) "ramblock %s offset 
> 0x%"PRIx64
> +postcopy_preempt_send_host_page(char *str, uint64_t offset) "ramblock %s 
> offset 0x%"PRIx64
> +postcopy_preempt_switch_channel(int channel) "%d"
> +postcopy_preempt_reset_channel(void) ""
>  
>  # multifd.c
>  multifd_new_send_channel_async(uint8_t id) "channel %u"
> @@ -176,6 +182,7 @@ migration_thread_low_pending(uint64_t pending) "%" PRIu64
>  migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t 
> bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " 
> bandwidth %" PRIu64 " max_size %" PRId64
>  process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>  process_incoming_migration_co_postcopy_end_main(void) ""
> +postcopy_preempt_enabled(bool value) "%d"
>  
>  # channel.c
>  migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p 
> ioctype=%s"
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]