qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistenc


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
Date: Thu, 18 Sep 2014 14:56:04 +0100
User-agent: Mutt/1.5.23 (2014-03-12)

On Wed, Sep 17, 2014 at 10:53:32PM +0200, Walid Nouri wrote:
> >Writing data safely to disk can take milliseconds.  Not sure how that
> >figures into your commit step, but I guess commit needs to be fast.
> >
> We have no time to waste ;) but the disk semantic at the primary should be
> kept as expected from the primary. The time to acknowledge a checkpoint from
> the secondary will be delayed for the time needed to write all pending I/O
> requests of a checkpoint to disk.
> I think for normal operation (just replication) the secondary can use the
> same semantics for the disc writes as the primary. Wouldn't that be safe
> enough?

There is the issue of request ordering (using write cache flushes).  The
secondary probably needs to perform requests in the same order and
interleave cache flushes in the same way as the primary.  Otherwise a
power failure on the secondary could leave the disk in an invalid state
that is impossible on the primary.  So I'm just pointing out that cache
flush operations matter, not just read/write.

The second, and bigger, point is that if disk commit holds back
checkpoint commit it could be a significant performance problem due to
the slow nature of disks.

There are fancier solutions using either a journal or snapshots that
provide data integrity without posing a performance bottleneck during
the commit phase.

The trick is to apply write requests as they come off the wire on the
secondary but use a journal or snapshot mechanism to enforce commit
semantics.  That way the commit doesn't have to wait for writing out all
the data to disk.

> >I/O requests happen in parallel with CPU execution, so could an I/O
> >request be pending across a checkpoint commit?  Live migration does not
> >migrate inflight requests, although it has special case code for
> >migration requests that have failed at the host level and need to be
> >retried.  Another way of putting this is that live migration uses
> >bdrv_drain_all() to quiesce disks before migrating device state - I
> >don't think you have that luxury since bdrv_drain_all() can take a long
> >time and is not suitable for microcheckpointing.
> >
> >Block devices have the following semantics:
> >1. There is no ordering between parallel in-flight I/O requests.
> >2. The guest sees the disk state for completed writes but it may not see
> >    disk state of in-flight writes (due to #1).
> >3. Completed writes are only guaranteed to be persistent across power
> >    failure if a disk cache flush was submitted and completed after the
> >    writes completed.
> >
> I'm not sure if I got your point.
> 
> The proposed MC block device protocol sends all block device state updates
> to the secondary directly after writing them to the primary block devices.
> This keeps the disc semantics for the primary and the secondary stays
> updated with the disc state changes of the actual epoch.
> 
> At the end of an epoch the primary gets paused to create a system state
> snapshot. A this moment there could be some pending write I/O requests on
> the primary which overlap with the generation of the system state snapshot?
> Do you meant a situation like that?

Yes, that's what I meant in the first paragraph.  The primary has not
completed the I/O request yet but QEMU's live migration is currently not
equipped to migrate in-flight requests so we're in trouble!

> If this is your point then I think you are right, this is possible...and
> that raises your interesting question: How to deal with pending requests at
> the end of an epoch or how to be sure that all disc state changes of an
> epoch have been replicated?
> 
> Currently the MC protocol only cares about a part of the system state
> (RAM,vCPU,devices) and excludes the block device state changes.
> 
> To correctly use drive-mirror functionality the MC protocol must also be
> extended to check that all disc state changes of the primary corresponding
> to the current epoch have been delivered to the secondary.
> 
> When all state data is completely sent the checkpoint transaction can be
> committed.
> 
> When the checkpoint transaction is complete the secondary commits its disc
> state buffer and the rest (RAM, vCPU,devices) of the checkpoint and ACKS the
> complete checkpoint to the primary.
> 
> IMHO the easiest way for MC to track that all block device changes have been
> replicated would be to ask drive-mirror if the paused primary has
> unprocessed write requests.
> 
> As long as there are dirty blocks or in-flights, the checkpoint transaction
> of the current epoch is not complete.
> 
> Maybe you can give me a hint what you think is the best way (api call(s)) to
> ask drive-mirror if there are pending write operations???

The details depend on the code and I don't remember everything well
enough.  Anyway, my mental model is:

1. The dirty bit is set *after* the primary has completed the write.
   See bdrv_aligned_pwritev().  Therefore you cannot use the dirty
   bitmap to query in-flight requests, instead you have to look at
   bs->tracked_requests.

2. The mirror block job periodically scans the dirty bitmap (when there
   is no rate-limit set it does this with no artifical delays) and
   writes the dirty blocks.

Given that cache flush requests probably need to be tracked too, maybe
you need MC-specific block driver on the primary to monitor and control
I/O requests.

But I haven't thought this through and it's non-trivial so we need to
break this down more.

> >>>I???m sure there are alternative and better approaches and I???m open for
> >>>any ideas
> >
> >You can use drive-mirror and the run-time NBD server in QEMU without
> >modification:
> >
> >   Primary (drive-mirror)   ---writes--->   Secondary (NBD export in QEMU)
> >
> >Your block filter idea can work and must have the logic so that a commit
> >operation sent via the microcheckpointing protocol causes the block
> >filter to write buffered data to disk and flush the host disk cache.
> 
> That's exactly what the block filter has to do. Where would be the right
> place to put the api call to the block filter flush logic "blockdev.c"?

block.c has the APIs that BlockDriverState nodes support.  For example,
bdrv_invalidate_cache_all() lives in block.c.

The other approach is for MC to offer a listener interface so interested
components can register callbacks that are invoked pre/post commit.

Both techniques are commonly used within QEMU, so I wouldn't worry about
that yet.  Best to decide when you are implementing the code.

Attachment: pgph8q4KkiJGH.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]