Re: [Qemu-devel] Fwd: [PATCH v2 00/41] postcopy live migration

From: Chegu Vinod
Subject: Re: [Qemu-devel] Fwd: [PATCH v2 00/41] postcopy live migration
Date: Mon, 04 Jun 2012 07:27:25 -0700
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20120428 Thunderbird/12.0.1

On 6/4/2012 6:13 AM, Isaku Yamahata wrote:
On Mon, Jun 04, 2012 at 05:01:30AM -0700, Chegu Vinod wrote:
Hello Isaku Yamahata,

I just saw your patches..Would it be possible to email me a tar bundle of these
patches (makes it easier to apply the patches to a copy of the upstream 
I uploaded them to github for those who are interested in it.

git://github.com/yamahata/qemu.git qemu-postcopy-june-04-2012
git://github.com/yamahata/linux-umem.git  linux-umem-june-04-2012

Thanks for the pointer...
BTW, I am also curious if you have considered using any kind of RDMA features 
optimizing the page-faults during postcopy ?
Yes, RDMA is interesting topic. Can we share your use case/concern/issues?

Looking at large sized guests (256GB and higher) running cpu/memory intensive enterprise workloads. The concerns are the same...i.e. having a predictable total migration time, minimal downtime/freeze-time and of course minimal service degradation to the workload(s) in the VM or the co-located VM's...

How large of a guest have you tested your changes with and what kind of workloads have you used so far ?

Thus we can collaborate.
You may want to see Benoit's results.

Yes. 'have already seen some of Benoit's results.

Hence the question about use of RDMA techniques for post copy.

As long as I know, he has not published
his code yet.





After the long time, we have v2. This is qemu part.
The linux kernel part is sent separatedly.

Changes v1 ->   v2:
- split up patches for review
- buffered file refactored
- many bug fixes
   Espcially PV drivers can work with postcopy
- optimization/heuristic

1 - 30: refactoring exsiting code and preparation
31 - 37: implement postcopy itself (essential part)
38 - 41: some optimization/heuristic for postcopy

This patch series implements postcopy live migration.[1]
As discussed at KVM forum 2011, dedicated character device is used for
distributed shared memory between migration source and destination.
Now we can discuss/benchmark/compare with precopy. I believe there are
much rooms for improvement.

[1] http://wiki.qemu.org/Features/PostCopyLiveMigration

You need load umem character device on the host before starting migration.
Postcopy can be used for tcg and kvm accelarator. The implementation depend
on only linux umem character device. But the driver dependent code is split
into a file.
I tested only host page size == guest page size case, but the implementation
allows host page size != guest page size case.

The following options are added with this patch series.
- incoming part
   command line options
   -postcopy [-postcopy-flags<flags>]
   where flags is for changing behavior for benchmark/debugging
   Currently the following flags are available
   0: default
   1: enable touching page request

   qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm

- outging part
   options for migrate command
   migrate [-p [-n] [-m]] URI [<prefault forward>   [<prefault backword>]]
   -p: indicate postcopy migration
   -n: disable background transferring pages: This is for benchmark/debugging
   -m: move background transfer of postcopy mode
   <prefault forward>: The number of forward pages which is sent with on-demand
   <prefault backward>: The number of backward pages which is sent with

   migrate -p -n tcp:<dest ip address>:4444
   migrate -p -n -m tcp:<dest ip address>:4444 32 0

- benchmark/evaluation. Especially how async page fault affects the result.
- improve/optimization
   At the moment at least what I'm aware of is
   - making incoming socket non-blocking with thread
     As page compression is comming, it is impractical to non-blocking read
     and check if the necessary data is read.
   - touching pages in incoming qemu process by fd handler seems suboptimal.
     creating dedicated thread?
   - outgoing handler seems suboptimal causing latency.
- consider on FUSE/CUSE possibility
- don't fork umemd, but create thread?

basic postcopy work flow
         qemu on the destination
         Here we have two file descriptors to
         umem device and shmem file
               |                                  umemd
               |                                  daemon on the destination
               V    create pipe to communicate
               |                                      |
               V                                      |
         close(socket)                                V
         close(shmem)                              mmap(shmem file)
               |                                      |
               V                                      V
         mmap(umem device) for guest RAM           close(shmem file)
               |                                      |
         close(umem device)                           |
               |                                      |
               V                                      |
         wait for ready from daemon<----pipe-----send ready message
               |                                      |
               |                                 Here the daemon takes over
         send ok------------pipe--------------->   the owner of the socket
               |                                        to the source
               V                                      |
         entering post copy stage                     |
         start guest execution                        |
               |                                      |
               V                                      V
         access guest RAM                          read() to get faulted pages
               |                                      |
               V                                      V
         page fault ------------------------------>page offset is returned
         block                                        |
                                                   pull page from the source
                                                   write the page contents
                                                   to the shmem.
         unblock<-----------------------------write() to tell served pages
         the fault handler returns the page
         page fault is resolved
               |                                   pages can be sent
               |                                   backgroundly
               |                                      |
               |                                      V
               |                                   write()
               |                                      |
               V                                      V
         The specified pages<-----pipe------------request to touch pages
         are made present by                          |
         touching guest RAM.                          |
               |                                      |
               V                                      V
              reply-------------pipe------------->   release the cached page
               |                                   madvise(MADV_REMOVE)
               |                                      |
               V                                      V

                  all the pages are pulled from the source

               |                                      |
               V                                      V
         the vma becomes anonymous<----------------UMEM_MAKE_VMA_ANONYMOUS
        (note: I'm not sure if this can be implemented or not)
               |                                      |
               V                                      V
         migration completes                        exit()

Isaku Yamahata (41):
   arch_init: export sort_ram_list() and ram_save_block()
   arch_init: export RAM_SAVE_xxx flags for postcopy
   arch_init/ram_save: introduce constant for ram save version = 4
   arch_init: refactor host_from_stream_offset()
   arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
   arch_init: refactor ram_save_block()
   arch_init/ram_save_live: factor out ram_save_limit
   arch_init/ram_load: refactor ram_load
   arch_init: introduce helper function to find ram block with id string
   arch_init: simplify a bit by ram_find_block()
   arch_init: factor out counting transferred bytes
   arch_init: factor out setting last_block, last_offset
   exec.c: factor out qemu_get_ram_ptr()
   exec.c: export last_ram_offset()
   savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
   savevm: qemu_pending_size() to return pending buffered size
   savevm, buffered_file: introduce method to drain buffer of buffered
   QEMUFile: add qemu_file_fd() for later use
   savevm/QEMUFile: drop qemu_stdio_fd
   savevm/QEMUFileSocket: drop duplicated member fd
   savevm: rename QEMUFileSocket to QEMUFileFD, socket_close to fd_close
   savevm/QEMUFile: introduce qemu_fopen_fd
   migration.c: remove redundant line in migrate_init()
   migration: export migrate_fd_completed() and migrate_fd_cleanup()
   migration: factor out parameters into MigrationParams
   buffered_file: factor out buffer management logic
   buffered_file: Introduce QEMUFileNonblock for nonblock write
   buffered_file: add qemu_file to read/write to buffer in memory
   umem.h: import Linux umem.h
   update-linux-headers.sh: teach umem.h to update-linux-headers.sh
   configure: add CONFIG_POSTCOPY option
   savevm: add new section that is used by postcopy
   postcopy: introduce -postcopy and -postcopy-flags option
   postcopy outgoing: add -p and -n option to migrate command
   postcopy: introduce helper functions for postcopy
   postcopy: implement incoming part of postcopy live migration
   postcopy: implement outgoing part of postcopy live migration
   postcopy/outgoing: add forward, backward option to specify the size
     of prefault
   postcopy/outgoing: implement prefault
   migrate: add -m (movebg) option to migrate command
   migration/postcopy: add movebg mode

