qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support f


From: Roni Luxenberg
Subject: Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Date: Thu, 31 May 2012 04:44:51 -0400 (EDT)

----- Original Message -----
> Il 30/05/2012 14:34, Geert Jansen ha scritto:
> > 
> > On 05/29/2012 02:52 PM, Paolo Bonzini wrote:
> > 
> >>> Does the drive-mirror coroutine send the writes to the target in
> >>> the
> >>> same order as they are sent to the source? I assume so.
> >>
> >> No, it doesn't.  It's asynchronous; for continuous replication,
> >> the
> >> target knows that it has a consistent view whenever it sees a
> >> flush on
> >> the NBD stream.  Flushing the target is needed anyway before
> >> writing the
> >> dirty bitmap, so the target might as well exploit them to get
> >> information about the state of the source.
> >>
> >> The target _must_ flush to disk when it receives a flush commands,
> >> not
> >> matter how close they are.  It _may_ choose to snapshot the disk,
> >> for
> >> example establishing one new snapshot every 5 seconds.
> > 
> > Interesting. So it works quite differently than i had assumed. Some
> > follow-up questions hope you don't mind...
> > 
> >  * I assume a flush roughly corresponds to an fsync() in the guest
> >  OS?
> 
> Yes (or a metadata flush from the guest OS filesystem, since our
> guest
> models do not support attaching the FUA bit to single writes).
> 

a continuous replication application expects to get all IOs in the same order 
as issued by the guest irrespective of the (rate of the) flushes done within 
the guest. This is required to be able to support more advanced features like 
cross VM consistency where the application protects a group of VMs (possibly 
running on different hosts) forming a single logical application/service.
Does the design strictly maintain this property?

> >  * Writes will not be re-ordered over a flush boundary, right?
> 
> More or less.  This for example is a valid ordering:
> 
>     write sector 0
>                              write 0 returns
>     flush
>     write sector 1
>                              write 1 returns
>                              flush returns
> 
> However, writes that have already returned will not be re-ordered
> over a
> flush boundary.
> 
> >> A synchronous implementation is not forbidden by the spec (by
> >> design),
> >> but at the moment it's a bit more complex to implement because, as
> >> you
> >> mention, it requires buffering the I/O data on the host.
> > 
> > So if i understand correctly, you'd only be keeping a list of
> > (len, offset) tuples without any data, and drive-mirror then reads
> > the
> > data from the disk image? If that is the case how do you handle a
> > flush?
> > Does a flush need to wait for drive-mirror to drain the entire
> > outgoing
> > queue to the target before it can complete? If not how do prevent
> > writes
> > that happen after a flush from overwriting the data that will be
> > sent to
> > the target in case that hasn't reached the flush point yet.
> 
> The key is that:
> 
> 1) you only flush the target when you have a consistent image of the
> source on the destination, and the replication server only creates a
> snapshot when it receives a flush.  Thus, the server does not create
> a
> consistent snapshot unless the client was able to keep pace with the
> guest.
> 
> 2) target flushes do not have to coincide with a source flush.
>  Writes
> after the last source flush _can_ be inconsistent between the source
> and
> the destination!  What matters is that all writes up to the last
> source
> flush are consistent.
> 
> Say the guest starts with (4 characters = 1 sectors) "AAAA BBBB CCCC"
> on
> disk
> 
> and then the following happens
> 
>     guest           disk           dirty count        mirroring
>  -------------------------------------------------------------------
>                                        0
>     write 1 = XXXX                     1
>     FLUSH
>                     write 1 = XXXX
>                     dirty bitmap: sector 1 dirty
>     write 2 = YYYY                     2
>                                        1           copy sector 1 =
>                                        XXXX
>                                        0           copy sector 2 =
>                                        YYYY
>                                                    FLUSH
>                     dirty bitmap: all clean
>     write 0 = ZZZZ
>                     write 0 = ZZZZ
> 
> and then a power loss happens on the source.
> 
> The guest now has the dirty bitmap saying "all clean" even though the
> source now is "ZZZZ XXXX CCCC" and the destination "AAAA XXXX YYYY".
> However, this is not a problem because both are consistent with the
> last
> flush.

under this design and assuming async implementation, is a same block that is 
written few times in a raw by the guest guaranteed to be received by the 
continuous replication agent the same exact number of times without overriding 
any of the writes?

Thanks,
Roni



reply via email to

[Prev in Thread] Current Thread [Next in Thread]