qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuou


From: Hongyang Yang
Subject: Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
Date: Fri, 9 Jan 2015 17:31:45 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0

Hi Paolo,

  Seems there are no more comments for now.
  We are going to implement COLO Disk replication as you suggested. And I add
your comments to the design doc, thank you!

在 12/27/2014 11:23 PM, Paolo Bonzini 写道:


On 26/12/2014 04:31, Yang Hongyang wrote:
Please feel free to comment.
We want comments/feedbacks as many as possiable please, thanks in advance.

Hi Yang,

I think it's possible to build COLO block replication from many basic
blocks that are already in QEMU.  The only new piece would be the disk
buffer on the secondary.

          virtio-blk       ||
              ^            ||                            .----------
              |            ||                            | Secondary
         1 Quorum          ||                            '----------
          /      \         ||
         /        \        ||
    Primary      2 NBD  ------->  2 NBD
      disk       client    ||     server                  virtio-blk
                           ||        ^                         ^
--------.                 ||        |                         |
Primary |                 ||  Secondary disk <--------- COLO buffer 3
--------'                 ||                   backing


1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM.  The read pattern patches for quorum
(http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
be used/extended to make the primary always read from the local disk
instead of going through NBD.

2) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).

3) The disk on the secondary is represented by a custom block device
("COLO buffer").  The disk buffer's backing image is the secondary disk,
and the disk buffer uses bdrv_add_before_write_notifier to implement
copy-on-write, similar to block/backup.c.

4) Checkpointing can use new bdrv_prepare_checkpoint and
bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
similar to your patches (you did not explain why you do checkpointing in
two steps).  Failover instead is done with bdrv_commit or can even be

If we use NBD to send block request, we don't need to do checkpoint in two
steps, because NBD will ensure all block req being sent to the secondary.
we use pre_checkpoint to wait for all request being received on secondary(
primary send an END flag to secondary when all request been sent at checkpoint,
secondary will wait for the flag been received on all disks and then
do_checkpoint).

We delete bdrv_pre_checkpoint interface in the design doc.

--
Thanks,
Yang.


From ef8a236d6fdcc88559cd9ce926173ef6eff74f77 Mon Sep 17 00:00:00 2001
From: Yang Hongyang <address@hidden>
Date: Thu, 25 Dec 2014 13:33:00 +0800
Subject: [POC v2] Block: Block replication design for COLO

This is the initial design of block replication.
The blkcolo block driver enables disk replication for continuous
checkpoints. It is designed for COLO that Secondary VM is running.
It can also be applied for FT/HA scene that Secondary VM is not
running.

Signed-off-by: Wen Congyang <address@hidden>
Signed-off-by: Lai Jiangshan <address@hidden>
Signed-off-by: Paolo Bonzini <address@hidden>
Signed-off-by: Yang Hongyang <address@hidden>
---
 docs/blkcolo.txt | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 134 insertions(+)
 create mode 100644 docs/blkcolo.txt

diff --git a/docs/blkcolo.txt b/docs/blkcolo.txt
new file mode 100644
index 0000000..3021928
--- /dev/null
+++ b/docs/blkcolo.txt
@@ -0,0 +1,134 @@
+Disk replication using blkcolo
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+The blkcolo block driver enables disk replication for continuous checkpoints.
+It is designed for COLO that Secondary VM is running. It can also be applied
+for FT/HA scene that Secondary VM is not running.
+
+This document gives an overview of blkcolo's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoint. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Disk Buffer ==
+The following is the image of Disk buffer:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be bufferd in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Implementation ==
+
+We are going to implement COLO block replication from many basic
+blocks that are already in QEMU.  The only new piece would be the disk
+buffer on the secondary.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary      2 NBD  ------->  2 NBD
+     disk       client    ||     server                  virtio-blk
+                          ||        ^                         ^
+--------.                 ||        |                         |
+Primary |                 ||  Secondary disk <--------- COLO buffer 3
+--------'                 ||                   backing
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM.  The read pattern patches for quorum
+(http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
+be used/extended to make the primary always read from the local disk
+instead of going through NBD.
+
+2) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+3) The disk on the secondary is represented by a custom block device
+("COLO buffer").  The disk buffer's backing image is the secondary disk,
+and the disk buffer uses bdrv_add_before_write_notifier to implement
+copy-on-write, similar to block/backup.c.
+
+4) Checkpointing can use bdrv_do_checkpoint interface in BlockDriver to
+discard the COLO buffer. Failover instead is done with bdrv_commit or
+can be done without stopping the secondary (live commit, block/commit.c).
+
+
+The missing parts are:
+
+1) NBD server on the backing image of the COLO buffer.  This means the
+backing image needs its own BlockBackend.  Apart for this, no new
+infrastructure is needed to receive writes on the secondary.
+
+2) Read pattern support for quorum need to be extended for the needs of
+the COLO primary.  It may be simpler or faster to write a simple
+"replication" driver that writes to N children but always reads from the
+first.  But in any case initial tests can be done with the quorum
+driver, even without read pattern support.
+
+3) The disk buffer itself.
+
+== Checkpoint & failover ==
+The blkcolo buffers the write requests in Secondary QEMU. And the buffer
+should be dropped at a checkpoint, or be flushed to Secondary disk when
+failover. We add four block driver interfaces to do this:
+a. bdrv_start_replication()
+   Start replication, called in migration/checkpoint thread
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transfered to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+c. bdrv_get_sent_data_size()
+   This is used on Primary node.
+   It should be called by migration/checkpoint thread in order
+   to decide whether to start a new checkpoint or not. If the data
+   amount being sent is too large, we should start a new checkpoint.
+d. bdrv_stop_replication()
+   It is called when failover. We will flush the Disk buffer into
+   Secondary Disk and stop disk replication.
+
+== Usage ==
+Primary:
+  1. NBD Client should not be the first child of quorum.
+  2. There should be only one NBD Client.
+
+Secondary:
+  -drive if=xxx,driver=colo,export=xxx,\
+         backing.file.filename=1.raw,\
+         backing.driver=raw
--
1.9.1





reply via email to

[Prev in Thread] Current Thread [Next in Thread]