[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [RFC] Kemari for KVM: updates

From: Yoshiaki Tamura
Subject: [Qemu-devel] [RFC] Kemari for KVM: updates
Date: Thu, 04 Feb 2010 14:38:17 +0900
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv: Gecko/20100111 Thunderbird/3.0.1

Hi all,

It has been a while again, and sorry for being less informative recently.  We
have been surveying KVM/QEMU in detail and implementing the prototype of Kemari
for KVM.  We are sending this message to share our status and TODO lists, and
get early feedback and, hopefully, confirm we're in the right direction.

This is a pretty long write-up, so please take a look at the components where
you're interested.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.


We feel pretty confident that policy, the transmission/transaction protocol, and
most of the control logic can be implemented in user-space as suggested by Avi.
That said, to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, some plumbing in the kernel side is likely to be
needed too.


== overall status ==

We first implemented Kemari with minimum impact to existing KVM/QEMU by
exploiting existing save/load framework almost as is, which we call v0.
Unfortunately, we faced some implementation issues in finding dirty pages
quickly, avoiding transfer data to be buffered locally, extending save/load
framework to repeat transactionally.  Although we took some workarounds and
confirmed that some of the concepts seemed to work, it didn't meet our
expectations.  We're currently working on the revised version as described
below, but for those who're interested in v0, we have prepared a git tree at the
following address.  Please keep in mind that v0 isn't what we're proposing for
review, and many of the implementations has been and will be changed.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari kemari-v0

The rest of this message describes our status and TODO lists grouped by each
topic.  Items marked as DONE have been implemented or surveyed, DOING are what
we're currently working on and planning to post for review by March, and TODO
are untackled or unassigned yet.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which event the
primary should synchronize with the secondary.  The basic assumption here is
that outgoing I/O operations are idempotent, which is usually true for disk I/O
and reliable network protocols such as TCP.  We have implemented this mechanism
in userland for the following items for now.

 - PIO

Items still left are,

 - virtio polling
 - support for asynchronous I/O methods (eventfd)

=== sender / receiver ===

To synchronize virtual machines, all the dirty pages since the last
synchronization point and the state of the VCPU the virtual devices is sent to
the fallback node from the user-space qemu process.

Although we're exploiting the existing savevm/loadvm infrastructure and KVM's
dirty page tracking capabilities, we need some enhancements to implement Kemari.
Especially, we would like to discuss the Kemari transfer protocol/format with
the community here, so please take a look at the document attached to this
message, and hopefully, would like to get some feedbacks before we start
implementing the protocol/format.  We also implemented items below to achieve
fast synchronization.

 - Kemari transfer protocol/format (see attached file)
 - dirty_bitmap scan speed up
 - Bypassing buffered_file
 - Using writev for page transfer

 - Implementing Kemari transfer protocol/format
 - Asynchronous VM transfer / pipelining (needed for SMP)
 - Zero copy VM transfer
 - VM transfer w/ RDMA

=== instruction level replayability ===

We're also investigating the emulation path of KVM/QEMU so that the
synchronization procedure described above makes the status of the primary and
the secondary consistent even when the guest is executing delicate instructions.
Because this is critical to realize fault tolerance, we're expecting to finish
this step ASAP.

 - String PIO
 - rep prefix

 - exceptions, interrupts
 - 16byte MMIO, 32byte MMIO
 - Memory accesses across MMIO boundaries

=== clock ===

Since synchronizing the virtual machines every time the TSC is accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.

 - Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===

These are items that defines how users interact with Kemari.

 - Qemu monitor command to enable/disable Kemari.

 - Kemarid daemon that takes care of the cluster management/monitoring
   side of things.
 - Some device emulators might need minor modifications to work well
   with Kemari.  Use white(black)-listing to take the burden of
   choosing the right device model off the users.

=== integration with HA stack (Pacemaker/Corosync) ===

Failover process kicks in whenever a failure in the primary node is detected.
For Kemari for Xen, we already have finished RA for Heartbeat, and planning to
integrate Kemari for KVM with the new HA stacks (Pacemaker, RHCS, etc).

Ideally, we would like to leverage the hardware failure detection
capabilities of newish x86 hardware to trigger failover, the idea
being that transferring control to the fallback node proactively
when a problem is detected is much faster than relying on the polling
mechanisms used by most HA software.

 - RA for Pacemaker.
 - Consider both HW failure and SW failure scenarios (failover
   between Kemari clusters).
 - Make the necessary changes to Pacemaker/Corosync to support
   event(HW failure, etc)-driven failover.
 - Take advantage of the RAS capabilities of newer CPUs/motherboards
   such as MCE to trigger failover.
   * KVM: need to make sure that MCE and other errors of that ilk are
     not injected to the guest and that all the CPUs are in a quiesced
     state before notifying user-space.
   * RA: there are two implementation alternatives; either
     sys_poll()'ing on /dev/mcelog or leveraging mced.
- Detect failures in I/O devices (block I/O errors, etc).
  * KVM/qemu: extend QMP's asynchronous events infrastructure to trap
    I/O errors (storage, network, etc).
  * RA: use QMP to connect to qemu and register an event notification
    callback for the I/O errors above (this can be done using the
    current libvirt API).
 - Integration with Kemarid (see usability above).

=== storage ===

Although Kemari needs some kind of shared storage, many users don't like it and
they expect to use Kemari in conjunction with software storage replication.

 - Integration with other non-shared disk cluster storage solutions
   such as DRBD (might need changes to guarantee storage data
   consistency at Kemari synchronization points).
 - Integration with qemu's block live migration functionality for
   non-share disk configurations.

=== optimizations ===

Although the big picture can be realized by completing the TODO list above, we
need some optimizations/enhancements to make Kemari useful in real world, and
these are items what needs to be done for that.

 - SMP (for the sake of performance might need to implement a
   synchronization protocol that can maintain two or more
   synchronization points active at any given moment)
 - VGA (leverage VNC's subtilting mechanism to identify fb pages that
   are really dirty).


Any comments and suggestions would be greatly appreciated.



Attachment: kemari_sender_receiver-0.2a.pdf
Description: Adobe PDF document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]