qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 13/18] arch_init: adjust ram_save_setup() for mi


From: Lei Li
Subject: Re: [Qemu-devel] [PATCH 13/18] arch_init: adjust ram_save_setup() for migrate_is_localhost
Date: Fri, 23 Aug 2013 17:00:19 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0

On 08/23/2013 03:48 PM, Paolo Bonzini wrote:
Il 23/08/2013 08:25, Lei Li ha scritto:
On 08/21/2013 06:48 PM, Paolo Bonzini wrote:
Il 21/08/2013 09:18, Lei Li ha scritto:
Send all the ram blocks hooked by save_page, which will copy
ram page and MADV_DONTNEED the page just copied.
You should implement this entirely in the hook.

It will be a little less efficient because of the dirty bitmap overhead,
but you should aim at having *zero* changes in arch_init.c and
migration.c.
Yes, the reason I modify the migration_thread() to have new process that
send all the ram pages in adjusted qemu_savevm_state_begin stage and send device
states in qemu_savevm_device_state stage for localhost migration is to avoid the
bitmap thing, which is a little less efficient just like you mentioned above.

The performance assurance is very important to this feature, our goal is
100ms of downtime for a 1TB guest.
Do not _start_ by introducing encapsulation violations all over the place.

Juan has been working on optimizing the dirty bitmap code.  His patches
could introduce a speedup of a factor of up to 64.  Thus it is possible
that his work will help you enough that you can work with the dirty bitmap.

Also, this feature (not looking at the dirty bitmap if the machine is
stopped) is not limited to localhost migration, add it later once the
basic vmsplice plumbing is in place.  This will also let you profile the
code and understand whether the goal is attainable.

I honestly doubt that 100ms of downtime is possible while the machine is
stopped.  A 1TB guest has 2^28 = 268*10^6 pages, which you want to
process in 100*10^6 nanoseconds.  Thus, your approach would require 0.4
nanoseconds per page, or roughly 2 clock cycles per page.  This is
impossible without _massive_ parallelization at all levels, starting
from the kernel.

As a matter of fact, 2^28 madvise system calls will take much, much
longer than 100ms.

Have you thought of using shared memory (with -mempath) instead of vmsplice?

Precisely!

Well, as Anthony mentioned in the version 1[1], there has been some work 
involved
regarding improvement of vmsplice() at kernel side by Robert Jennings[2].

And yes, shared memory is an alternative, I think the problem with shared 
memory is
that can't share anonymous memory. For this maybe Anthony can chime in as the 
original
idea him.  :-)


Reference links:

[1] Anthony's comments:
  https://lists.gnu.org/archive/html/qemu-devel/2013-06/msg02577.html

[2] vmpslice support for zero-copy gifting of pages:
  http://comments.gmane.org/gmane.linux.kernel.mm/103998


Paolo



--
Lei




reply via email to

[Prev in Thread] Current Thread [Next in Thread]