[Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM

From:	Yoshiaki Tamura
Subject:	[Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Date:	Tue, 17 Nov 2009 20:04:20 +0900
User-agent:	Thunderbird 2.0.0.23 (Windows/20090812)

Avi Kivity wrote:

On 11/16/2009 04:18 PM, Fernando Luis Vázquez Cao wrote:

Avi Kivity wrote:

On 11/09/2009 05:53 AM, Fernando Luis Vázquez Cao wrote:


Kemari runs paired virtual machines in an active-passive configuration
and achieves whole-system replication by continuously copying the
state of the system (dirty pages and the state of the virtual devices)
from the active node to the passive node. An interesting implication
of this is that during normal operation only the active node is
actually executing code.

Can you characterize the performance impact for various workloads? Iassume you are running continuously in log-dirty mode. Doesn't thismake memory intensive workloads suffer?


Yes, we're running continuously in log-dirty mode.

We still do not have numbers to show for KVM, but
the snippets below from several runs of lmbench
using Xen+Kemari will give you an idea of what you
can expect in terms of overhead. All the tests were
run using a fully virtualized Debian guest with
hardware nested paging enabled.

                     fork exec   sh    P/F  C/S   [us]
------------------------------------------------------
Base                  114  349 1197 1.2845  8.2
Kemari(10GbE) + FC    141  403 1280 1.2835 11.6
Kemari(10GbE) + DRBD  161  415 1388 1.3145 11.6
Kemari(1GbE) + FC     151  410 1335 1.3370 11.5
Kemari(1GbE) + DRBD   162  413 1318 1.3239 11.6
* P/F=page fault, C/S=context switch

The benchmarks above are memory intensive and, as you
can see, the overhead varies widely from 7% to 40%.
We also measured CPU bound operations, but, as expected,
Kemari incurred almost no overhead.


Is lmbench fork that memory intensive?

Do you have numbers for benchmarks that use significant anonymous RSS?Say, a parallel kernel build.

Note that scaling vcpus will increase a guest's memory-dirtying powerbut snapshot rate will not scale in the same way.


I don't think lmbench is intensive but it's sensitive to memory latency.
We'll measure kernel build time with minimum config, and post it later.

  - Notification to qemu: Taking a page from live migration's
    playbook, the synchronization process is user-space driven, which
    means that qemu needs to be woken up at each synchronization
    point. That is already the case for qemu-emulated devices, but we
    also have in-kernel emulators. To compound the problem, even for
    user-space emulated devices accesses to coalesced MMIO areas can
    not be detected. As a consequence we need a mechanism to
    communicate KVM-handled events to qemu.


Do you mean the ioapic, pic, and lapic?


Well, I was more worried about the in-kernel backends currently in the

works. To save the state of those devices we could leverage qemu'svmstate

infrastructure and even reuse struct VMStateDescription's pre_save()
callback, but we would like to pass the device state through the kvm_run
area to avoid a ioctl call right after returning to user space.

Hm, let's defer all that until we have something working so we canestimate the impact of userspace virtio in those circumstances.


OK.  We'll start implementing everything in userspace first.

Why is access to those chips considered a synchronization point?
The main problem with those is that to get the chip state we
use an ioctl when we could have copied it to qemu's memory
before going back to user space. Not all accesses to those chips
need to be treated as synchronization points.
Ok. Note that piggybacking on an exit will work for the lapic, but notfor the global irqchips (ioapic, pic) since they can still be modifiedby another vcpu.
I wonder if you can pipeline dirty memory synchronization. That is,write-protect those pages that are dirty, start copying them to theother side, and continue execution, copying memory if the guestfaults it again.
Asynchronous transmission of dirty pages would be really helpful to
eliminate the performance hiccups that tend to occur at synchronization
points. What we can do is to copy dirty pages asynchronously until wereach
a synchronization point, where we need to stop the guest and send the
remaining dirty pages and the state of devices to the other side.

However, we can not delay the transmission of a dirty page across a
synchronization point, because if the primary node crashed before the
page reached the fallback node the I/O operation that caused the
synchronization point cannot be replayed reliably.
What I mean is:

- choose synchronization point A
- start copying memory for synchronization point A
  - output is delayed
- choose synchronization point B
- copy memory for A and B
   if guest touches memory not yet copied for A, COW it
- once A copying is complete, release A output
- continue copying memory for B
- choose synchronization point B
by keeping two synchronization points active, you don't have anypauses. The cost is maintaining copy-on-write so we can copy dirtypages for A while keeping execution.

The overall idea seems good, but if I'm understanding correctly, we need abuffer for copying memory locally, and when it gets full, or when we COW thememory for B, we still have to pause the guest to prevent from overwriting. Correct?

To make things simple, we would like to start with the synchronous transmissionfirst, and tackle asynchronous transmission later.

How many pages do you copy per synchronization point for reasonablydifficult workloads?


That is very workload-dependent, but if you take a look at the examples
below you can get a feeling of how Kemari behaves.

IOzone            Kemari sync interval[ms]  dirtied pages
---------------------------------------------------------
buffered + fsync                       400           3000
O_SYNC                                  10             80

In summary, if the guest executes few I/O operations, the interval
between Kemari synchronizations points will increase and the number of
dirtied pages will grow accordingly.


In the example above, the externally observed latency grows to 400 ms, yes?

Not exactly. The sync interval refers to the interval of synchronization pointscaptured when the workload is running. In the example above, when the observedsync interval is 400ms, it takes about 150ms to sync VMs with 3000 dirtiedpages. Kemari resumes I/O operations immediately once the synchronization isfinished, and thus, the externally observed latency is 150ms in this case.


Thanks,

Yoshi

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [RFC] KVM Fault Tolerance: Kemari for KVM, Fernando Luis Vázquez Cao, 2009/11/11
- [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Dor Laor, 2009/11/12
  - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Yoshiaki Tamura, 2009/11/15
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Dor Laor, 2009/11/15
- [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Avi Kivity, 2009/11/15
  - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Fernando Luis Vázquez Cao, 2009/11/16
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Avi Kivity, 2009/11/16
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Yoshiaki Tamura <=
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Avi Kivity, 2009/11/17
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Yoshiaki Tamura, 2009/11/17
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Yoshiaki Tamura, 2009/11/18
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Avi Kivity, 2009/11/18
    - [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM, Yoshiaki Tamura, 2009/11/18

Prev by Date: [Qemu-devel] bug report with kqemu on AMD 64
Next by Date: Re: [Qemu-devel] Re: [PATCH] megasas: LSI MegaRAID SAS HBA emulation
Previous by thread: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Next by thread: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Index(es):
- Date
- Thread