qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistenc


From: Walid Nouri
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
Date: Wed, 17 Sep 2014 22:53:32 +0200
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

Thank you for your Time and the detailed answer!
I have needed some time to work through your answer ;-)

What MC needs is a block device agnostic, controlled and asynchronous
approach for replicating the contents of block devices and its state changes
to the secondary VM while the primary VM is running. Asynchronous block
transfer is important to allow maximum performance for the primary VM, while
keeping the secondary VM updated with state changes.

The block device replication should be possible in two stages or modes.

The first stage is the live copy of all block devices of the primary to the
secondary. This is necessary if the secondary doesn???t have an existing
image which is in sync with the primary at the time MC has started. This is
not very convenient but as far as I know actually there is no mechanism for
persistent dirty bitmap in QEMU.

I think you are trying to address the non-shared storage cause where the
secondary needs to acquire the initial state of the primary.

That's correct!

drive-mirror copies the contents of a source disk image to a
destination.  If the guest is running while copying takes place then new
writes will also be mirrored.

drive-mirror should be sufficient for the initial phase where primary
and secondary get in sync.

Fam Zheng sent a patch series earlier this year to add dirty bitmaps for
block devices to QEMU.  It only supported in-memory bitmaps but
persistent bitmaps are fairly straightforward to implement.  I'm
interested in these patches for the incremental backup use case.
https://lists.gnu.org/archive/html/qemu-devel/2014-03/msg05250.html

I guess the reason you mention persistent bitmaps is to save time when
adding a host that previously participated and has an older version of
the disk image?

Yes, it is desirable not to always mirror the whole image before the MC protection can become active. This would save time in case of a lost communication, shutdown or maintenance on the secondary.

The persistent dirty bitmap must have a mechanism to identify that a pair of images belong to each other and which of both is the primary with the actual valid data. I think that's self-sufficient "little" project...but the next logical step :)


The second stage (mode) is the replication of block device state changes
(modified blocks)  to keep the image on the secondary in sync with the
primary. The mirrored blocks must be buffered in ram (block buffer) until
the complete Checkpoint (RAM, vCPU, device state) can be committed.

For keeping the complete system state consistent on the secondary system
there must be a possibility for MC to commit/discard block device state
changes. In normal operation the mirrored block device state changes (block
buffer) are committed to disk when the complete checkpoint is committed. In
case of a crash of the primary system while transferring a checkpoint the
data in the block buffer corresponding to the failed Checkpoint must be
discarded.

Thoughts:

Writing data safely to disk can take milliseconds.  Not sure how that
figures into your commit step, but I guess commit needs to be fast.

We have no time to waste ;) but the disk semantic at the primary should be kept as expected from the primary. The time to acknowledge a checkpoint from the secondary will be delayed for the time needed to write all pending I/O requests of a checkpoint to disk. I think for normal operation (just replication) the secondary can use the same semantics for the disc writes as the primary. Wouldn't that be safe enough?

I/O requests happen in parallel with CPU execution, so could an I/O
request be pending across a checkpoint commit?  Live migration does not
migrate inflight requests, although it has special case code for
migration requests that have failed at the host level and need to be
retried.  Another way of putting this is that live migration uses
bdrv_drain_all() to quiesce disks before migrating device state - I
don't think you have that luxury since bdrv_drain_all() can take a long
time and is not suitable for microcheckpointing.

Block devices have the following semantics:
1. There is no ordering between parallel in-flight I/O requests.
2. The guest sees the disk state for completed writes but it may not see
    disk state of in-flight writes (due to #1).
3. Completed writes are only guaranteed to be persistent across power
    failure if a disk cache flush was submitted and completed after the
    writes completed.

I'm not sure if I got your point.

The proposed MC block device protocol sends all block device state updates to the secondary directly after writing them to the primary block devices. This keeps the disc semantics for the primary and the secondary stays updated with the disc state changes of the actual epoch.

At the end of an epoch the primary gets paused to create a system state snapshot. A this moment there could be some pending write I/O requests on the primary which overlap with the generation of the system state snapshot? Do you meant a situation like that?

If this is your point then I think you are right, this is possible...and that raises your interesting question: How to deal with pending requests at the end of an epoch or how to be sure that all disc state changes of an epoch have been replicated?

Currently the MC protocol only cares about a part of the system state (RAM,vCPU,devices) and excludes the block device state changes.

To correctly use drive-mirror functionality the MC protocol must also be extended to check that all disc state changes of the primary corresponding to the current epoch have been delivered to the secondary.

When all state data is completely sent the checkpoint transaction can be committed.

When the checkpoint transaction is complete the secondary commits its disc state buffer and the rest (RAM, vCPU,devices) of the checkpoint and ACKS the complete checkpoint to the primary.

IMHO the easiest way for MC to track that all block device changes have been replicated would be to ask drive-mirror if the paused primary has unprocessed write requests.

As long as there are dirty blocks or in-flights, the checkpoint transaction of the current epoch is not complete.

Maybe you can give me a hint what you think is the best way (api call(s)) to ask drive-mirror if there are pending write operations???

I think this can be achieved by drive-mirror and a filter block driver.
Another approach could be to exploit the block migration functionality of
live migration with a filter block driver.

block-migration.c should be avoided because it may be dropped from QEMU.
It is unloved code and has been replaced by drive-mirror.

Good to know!!!
I will avoid using block-migration.c.


The drive-mirror (and live migration) does not rely on shared storage and
allow live block device copy and incremental syncing.

A block buffer can be implemented with a QEMU filter block driver. It should
sit at the same position as the Quorum driver in the block driver hierarchy.
When using block filter approach MC will be transparent and block device
agnostic.

The block buffer filter must have an Interface which allows MC control the
commits or discards of block device state changes. I have no idea where to
put such an interface to stay conform with QEMU coding style.


I???m sure there are alternative and better approaches and I???m open for
any ideas

You can use drive-mirror and the run-time NBD server in QEMU without
modification:

   Primary (drive-mirror)   ---writes--->   Secondary (NBD export in QEMU)

Your block filter idea can work and must have the logic so that a commit
operation sent via the microcheckpointing protocol causes the block
filter to write buffered data to disk and flush the host disk cache.

That's exactly what the block filter has to do. Where would be the right place to put the api call to the block filter flush logic "blockdev.c"?

To ensure that the disk image on the secondary is always in a crash
consistent state (i.e. the state you get from power failure), the
secondary needs to know when disk cache flush requests were sent and the
write ordering.  That way, even if there is a power failure while the
secondary is committing, the disk will be in a crash consistent state.
After the secondary (or primary) is booted again file systems or
databases will be able to fsck and resume.

(In other words, in a catastrophic failure you won't be any worse off
than with a power failure on an unprotected single machine.)

I case of a fail over the secondary must drain all discs before becoming the new primary even if there is delay caused by flushing disc buffers. Otherwise the state of the block device could be not consistent with the rest of the system state when (the new) primary starts processing.


Walid



reply via email to

[Prev in Thread] Current Thread [Next in Thread]