qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replicat


From: Wen Congyang
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Date: Tue, 21 Apr 2015 09:25:59 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang <address@hidden>
>> Signed-off-by: Paolo Bonzini <address@hidden>
>> Signed-off-by: Yang Hongyang <address@hidden>
>> Signed-off-by: zhanghailiang <address@hidden>
>> Signed-off-by: Gonglei <address@hidden>
>> ---
>>  docs/block-replication.txt | 153 
>> +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 153 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 0000000..4426ffc
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,153 @@
>> +Block replication
>> +----------------------------------------
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +Block replication is used for continuous checkpoints. It is designed
>> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
>> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
>> +where the Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> +        +----------------------+            +------------------------+
>> +        |Primary Write Requests|            |Secondary Write Requests|
>> +        +----------------------+            +------------------------+
>> +                  |                                       |
>> +                  |                                      (4)
>> +                  |                                       V
>> +                  |                              /-------------\
>> +                  |      Copy and Forward        |             |
>> +                  |---------(1)----------+       | Disk Buffer |
>> +                  |                      |       |             |
>> +                  |                     (3)      \-------------/
>> +                  |                 speculative      ^
>> +                  |                write through    (2)
>> +                  |                      |           |
>> +                  V                      V           |
>> +           +--------------+           +----------------+
>> +           | Primary Disk |           | Secondary Disk |
>> +           +--------------+           +----------------+
>> +
>> +    1) Primary write requests will be copied and forwarded to Secondary
>> +       QEMU.
>> +    2) Before Primary write requests are written to Secondary disk, the
>> +       original sector content will be read from Secondary disk and
>> +       buffered in the Disk buffer, but it will not overwrite the existing
>> +       sector content(it could be from either "Secondary Write Requests" or
>> +       previous COW of "Primary Write Requests") in the Disk buffer.
>> +    3) Primary write requests will be written to Secondary disk.
>> +    4) Secondary write requests will be buffered in the Disk buffer and it
>> +       will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> +         virtio-blk       ||
>> +             ^            ||                            .----------
>> +             |            ||                            | Secondary
>> +        1 Quorum          ||                            '----------
>> +         /      \         ||
>> +        /        \        ||
>> +   Primary      2 NBD  ------->  2 NBD
>> +     disk       client    ||     server                                     
>>     virtio-blk
>> +                          ||        ^                                       
>>          ^
>> +--------.                 ||        |                                       
>>          |
>> +Primary |                 ||  Secondary disk <--------- hidden-disk 4 
>> <--------- active-disk 3
>> +--------'                 ||        |          backing        ^       
>> backing
>> +                          ||        |                         |
>> +                          ||        |                         |
>> +                          ||        '-------------------------'
>> +                          ||           drive-backup sync=none
> 
> Nice to see that you've been able to construct the replication flow from
> existing block layer features!
> 
>> +1) The disk on the primary is represented by a block device with two
>> +children, providing replication between a primary disk and the host that
>> +runs the secondary VM. The read pattern for quorum can be extended to
>> +make the primary always read from the local disk instead of going through
>> +NBD.
>> +
>> +2) The secondary disk receives writes from the primary VM through QEMU's
>> +embedded NBD server (speculative write-through).
>> +
>> +3) The disk on the secondary is represented by a custom block device
>> +(called active-disk). It should be an empty disk, and the format should
>> +be qcow2.
>> +
>> +4) The hidden-disk is created automatically. It buffers the original content
>> +that is modified by the primary VM. It should also be an empty disk, and
>> +the driver supports bdrv_make_empty().
>> +
>> +== New block driver interface ==
>> +We add three block driver interfaces to control block replication:
>> +a. bdrv_start_replication()
>> +   Start block replication, called in migration/checkpoint thread.
>> +   We must call bdrv_start_replication() in secondary QEMU before
>> +   calling bdrv_start_replication() in primary QEMU.
>> +b. bdrv_do_checkpoint()
>> +   This interface is called after all VM state is transferred to
>> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
>> +   The caller must hold the I/O mutex lock if it is in migration/checkpoint
>> +   thread.
>> +c. bdrv_stop_replication()
>> +   It is called on failover. We will flush the Disk buffer into
>> +   Secondary Disk and stop block replication. The vm should be stopped
>> +   before calling it. The caller must hold the I/O mutex lock if it is
>> +   in migration/checkpoint thread.
> 
> I understand the general flow but this description does not demonstrate
> that failover works or what happens when internal operations fail (e.g.
> during checkpoint commit or during failover).  Since fault tolerance is
> the goal, it is necessary to list the failure scenarios explicitly and
> show that the design handles them.  With that level of planning, some
> cases will probably be missed in the code and the system won't actually
> be fault tolerant.

OK, I will add the description about failover.

> 
> One general question about the design: the Secondary host needs 3x
> storage space since it has the Secondary Disk, hidden-disk, and
> active-disk.  Each image requires a certain amount of space depending on
> writes or COW operations.  Is 3x the upper bound or is there a way to
> reduce the bound?

active disk and hidden disk are temp file. It will be maked empty in
bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
many spaces if we do checkpoint periodically.

> 
> The bound is important since large amounts of data become a bottleneck
> for writeout/commit operations.  They could cause downtime if the guest
> is blocked until the entire Disk Buffer has been written to the
> Secondary Disk during failover, for example.

OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
I run filebench in the guest. Is there anyway to speed it up?

> 
>> +== Usage ==
>> +Primary:
>> +  -drive if=xxx,driver=quorum,read-pattern=fifo,\
>> +         children.0.file.filename=1.raw,\
>> +         children.0.driver=raw,\
>> +         children.1.file.driver=nbd+colo,\
>> +         children.1.file.host=xxx,\
>> +         children.1.file.port=xxx,\
>> +         children.1.file.export=xxx,\
>> +         children.1.driver=raw,\
>> +         children.1.ignore-errors=on
>> +  Note:
>> +  1. NBD Client should not be the first child of quorum.
>> +  2. There should be only one NBD Client.
>> +  3. host is the secondary physical machine's hostname or IP
>> +  4. Each disk must have its own export name.
>> +  5. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
>> +
>> +Secondary:
>> +  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
>> +  -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
>> +         backing_reference.drive_id=nbd_target1,\
>> +         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
>> +         backing_reference.hidden-disk.driver=qcow2,\
>> +         backing_reference.hidden-disk.allow-write-backing-file=on
>> +  Then run qmp command:
>> +    nbd_server_start host:port
>> +  Note:
>> +  1. The export name for the same disk must be the same in primary
>> +     and secondary QEMU command line
>> +  2. The qmp command nbd-server-start must be run before running the
>> +     qmp command migrate on primary QEMU
>> +  3. Don't use nbd-server-start's other options
>> +  4. Active disk, hidden disk and nbd target's length should be the
>> +     same.
>> +  5. It is better to put active disk and hidden disk in ramdisk.
>> +  6. It is all a single argument to -drive, and you should ignore
>> +     the leading whitespace.
> 
> Please do not introduce "<name>+colo" block drivers.  This approach is
> invasive and makes block replication specific to only a few block
> drivers, e.g. NBD or qcow2.

NBD is used to connect to secondary qemu, so it must be used. But the primary
qemu uses quorum, so the primary disk can be any format.
The secondary disk is nbd target, and it can also be any format. The cache
disk(active disk/hidden disk) is an empty disk, and it is created before run
COLO. The cache disk format is qcow2 now. In theory, it can be ant format which
supports backing file. But the driver should be updated to support colo mode.

> 
> A cleaner approach is a QMP command or -drive options that work for any
> BlockDriverState.

OK, I will add a new drive option to avoid use "<name>+colo".

Thanks
Wen Congyang

> 
> Stefan
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]