It has been a while coming, but we have finally started work on
Kemari's port to KVM. For those not familiar with it, Kemari provides
the basic building block to create a virtualization-based fault
tolerant machine: a virtual machine synchronization mechanism.
Traditional high availability solutions can be classified in two
groups: fault tolerant servers, and software clustering.
Broadly speaking, fault tolerant servers protect us against hardware
failures and, generally, rely on redundant hardware (often
proprietary), and hardware failure detection to trigger fail-over.
On the other hand, software clustering, as its name indicates, takes
care of software failures and usually requires a standby server whose
software configuration for the part we are trying to make fault
tolerant must be identical to that of the active server.
Both solutions may be applied to virtualized environments. Indeed,
the current incarnation of Kemari (Xen-based) brings fault tolerant
server-like capabilities to virtual machines and integration with
existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
After some time in the drawing board we completed the basic design of
Kemari for KVM, so we are sending an RFC at this point to get early
feedback and, hopefully, get things right from the start. Those
already familiar with Kemari and/or fault tolerance may want to skip
the "Background" and go directly to the design and implementation
This is a pretty long write-up, but please bear with me.
== Background ==
We started to play around with continuous virtual synchronization
technology about 3 years ago. As development progressed and, most
importantly, we got the first Xen-based working prototypes it became
clear that we needed a proper name for our toy: Kemari.
The goal of Kemari is to provide a fault tolerant platform for
virtualization environments, so that in the event of a hardware
failure the virtual machine fails over from compromised to properly
operating hardware (a physical machine) in a way that is completely
transparent to the guest operating system.
Although hardware based fault tolerant servers and HA servers
(software clustering) have been around for a (long) while, they
typically require specifically designed hardware and/or modifications
to applications. In contrast, by abstracting hardware using
virtualization, Kemari can be used on off-the-shelf hardware and no
application modifications are needed.
After a period of in-house development the first version of Kemari for
Xen was released in Nov 2008 as open source. However, by then it was
already pretty clear that a KVM port would have several
advantages. First, KVM is integrated into the Linux kernel, which
means one gets support for a wide variety of hardware for
free. Second, and in the same vein, KVM can also benefit from Linux'
low latency networking capabilities including RDMA, which is of
paramount importance for a extremely latency-sensitive functionality
like Kemari. Last, but not the least, KVM and its community is growing
rapidly, and there is increasing demand for Kemari-like functionality
Although the basic design principles will remain the same, our plan is
to write Kemari for KVM from scratch, since there does not seem to be
much opportunity for sharing between Xen and KVM.
== Design outline ==
The basic premise of fault tolerant servers is that when things go
awry with the hardware the running system should transparently
continue execution on an alternate physical host. For this to be
possible the state of the fallback host has to be identical to that of
Kemari runs paired virtual machines in an active-passive configuration
and achieves whole-system replication by continuously copying the
state of the system (dirty pages and the state of the virtual devices)
from the active node to the passive node. An interesting implication
of this is that during normal operation only the active node is
actually executing code.
Another possible approach is to run a pair of systems in lock-step
(à la VMware FT). Since both the primary and fallback virtual machines
are active keeping them synchronized is a complex task, which usually
involves carefully injecting external events into both virtual
machines so that they result in identical states.
The latter approach is extremely architecture specific and not SMP
friendly. This spurred us to try the design that became Kemari, which
we believe lends itself to further optimizations.
== Implementation ==
The first step is to encapsulate the machine to be protected within a
virtual machine. Then the live migration functionality is leveraged to
keep the virtual machines synchronized.
Whereas during live migration dirty pages can be sent asynchronously
from the primary to the fallback server until the ratio of dirty pages
is low enough to guarantee very short downtimes, when it comes to
fault tolerance solutions whenever a synchronization point is reached
changes to the virtual machine since the previous one have to be sent
Since the virtual machine has to be stopped until the data reaches and
is acknowledged by the fallback server, the synchronization model is
of critical importance for performance (both in terms of raw
throughput and latencies). The model chosen for Kemari along with
other implementation details is described below.
* Synchronization model
The synchronization points were carefully chosen to minimize the
amount of traffic that goes over the wire while still maintaining the
FT pair consistent at all times. To be precise, Kemari uses events
that modify externally visible state as synchronizations points. This
means that all outgoing I/O needs to be trapped and sent to the
fallback host before the primary is resumed, so that it can be
replayed in the face of hardware failure.
The basic assumption here is that outgoing I/O operations are
idempotent, which is usually true for disk I/O and reliable network
protocols such as TCP (Kemari may trigger hidden bugs on applications
that use UDP or other unreliable protocols, so those may need minor
changes to ensure they work properly after failover).
The synchronization process can be broken down as follows:
- Event tapping: On KVM all I/O generates a VMEXIT that is
synchronously handled by the Linux kernel monitor i.e. KVM (it is
worth noting that this applies to virtio devices too, because they
use MMIO and PIO just like a regular PCI device).
- VCPU/Guest freezing: This is automatic in the UP case. On SMP
environments we may need to send a IPI to stop the other VCPUs.
- Notification to qemu: Taking a page from live migration's
playbook, the synchronization process is user-space driven, which
means that qemu needs to be woken up at each synchronization
point. That is already the case for qemu-emulated devices, but we
also have in-kernel emulators. To compound the problem, even for
user-space emulated devices accesses to coalesced MMIO areas can
not be detected. As a consequence we need a mechanism to
communicate KVM-handled events to qemu.
The channel for KVM-qemu communication can be easily build upon
the existing infrastructure. We just need to add a new a page to
the kvm_run shared memory area that can be mmapped from user space
and set the exit reason appropriately.
Regarding in-kernel device emulators, we only need to care about
writes. Specifically, making kvm_io_bus_write() fail when Kemari
is activated and invoking the emulator again after re-entrance
from user space should suffice (this is somewhat similar to what
we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
To avoid missing synchronization points one should be careful with
coalesced MMIO-like optimizations. In the particular case of
coalesced MMIO, the I/O operation that caused the exit to user
space should act as a write barrier when it was due to an access
to a non-coalesced MMIO area. This means that before proceeding to
handle the exit in kvm_run() we have to make sure that all the
coalesced MMIO has reached the fallback host.
- Virtual machine synchronization: All the dirty pages since the
last synchronization point and the state of the virtual devices is
sent to the fallback node from the user-space qemu process. For this
the existing savevm infrastructure and KVM's dirty page tracking