qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] replication agent module


From: Ori Mamluk
Subject: Re: [Qemu-devel] [RFC PATCH] replication agent module
Date: Tue, 07 Feb 2012 16:45:08 +0200
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20111222 Thunderbird/9.0.1

On 07/02/2012 15:34, Kevin Wolf wrote:
Am 07.02.2012 11:29, schrieb Ori Mamluk:
Repagent is a new module that allows an external replication system to
replicate a volume of a Qemu VM.

This RFC patch adds the repagent client module to Qemu.



Documentation of the module role and API is in the patch at
replication/qemu-repagent.txt



The main motivation behind the module is to allow replication of VMs in
a virtualization environment like RhevM.

To achieve this we need basic replication support in Qemu.



This is the first submission of this module, which was written as a
Proof Of Concept, and used successfully for replicating and recovering a
Qemu VM.
I'll mostly ignore the code for now and just comment on the design.
That's fine. The code was mainly for my understanding of the system.
One thing to consider for the next version of the RFC would be to split
this in a series smaller patches. This one has become quite large, which
makes it hard to review (and yes, please use git send-email).

Points and open issues:

*             The module interfaces the Qemu storage stack at block.c
generic layer. Is this the right place to intercept/inject IOs?
There are two ways to intercept I/O requests. The first one is what you
chose, just add some code to bdrv_co_do_writev, and I think it's
reasonable to do this.

The other one would be to add a special block driver for a replication:
protocol that writes to two different places (the real block driver for
the image, and the network connection). Generally this feels even a bit
more elegant, but it brings new problems with it: For example, when you
create an external snapshot, you need to pay attention not to lose the
replication because the protocol is somewhere in the middle of a backing
file chain.
Yes. With this solution we'll have to somehow make sure that the replication driver is closer to the guest than any driver which alters the IO.


*             The patch contains performing IO reads invoked by a new
thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
is not protected by any lock – is this OK?
No, definitely not. Block layer code expects that it holds
qemu_global_mutex.

I'm not sure if a thread is the right solution. You should probably use
something that resembles other asynchronous code in qemu, i.e. either
callback or coroutine based.
I call bdrv_aio_readv - which in my understanding creates a co-routing, so my current solution is co-routines based. Did I get something wrong?


*             VM ID – the replication system implies an environment with
several VMs connected to a central replication system (Rephub).
                 This requires some sort of identification for a VM. The
current patch does not include a VM ID – I did not find any adequate ID
to use.
The replication hub already opened a connection to the VM, so it somehow
managed to know which VM this process represents, right?
The current design has the server at the Rephub side, so the VM connects to the Rephub, and not the other way around. The VM could be instructed to "enable protection" by a monitor command, and then it connects to the 'known' Rephub.
The unique ID would be something like the PID of the VM or the file
descriptor of the communication channel to it.
The PID might be useful - we'll later need to correlate it to the way Rhevm identifies the machine, but not right now...
diff --git a/Makefile b/Makefile

index 4f6eaa4..a1b3701 100644

--- a/Makefile

+++ b/Makefile

@@ -149,9 +149,9 @@ qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o cmd.o
qemu-ga.o: $(GENERATED_HEADERS

tools-obj-y = qemu-tool.o $(oslib-obj-y) $(trace-obj-y) \

                qemu-timer-common.o cutils.o

-qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)

-qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)

-qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)

+qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
$(replication-obj-y)

+qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
$(replication-obj-y)

+qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
$(replication-obj-y)
$(replication-obj-y) should be included in $(block-obj-y) instead


@@ -2733,6 +2739,7 @@ echo "curl support      $curl"

echo "check support     $check_utests"

echo "mingw32 support   $mingw32"

echo "Audio drivers     $audio_drv_list"

+echo "Replication          $replication"

echo "Extra audio cards $audio_card_list"

echo "Block whitelist   $block_drv_whitelist"

echo "Mixer emulation   $mixemu"
Why do you add it in the middle rather than at the end?
No reason, I'll change it.

diff --git a/replication/qemu-repagent.txt b/replication/qemu-repagent.txt

new file mode 100755

index 0000000..e3b0c1e

--- /dev/null

+++ b/replication/qemu-repagent.txt

@@ -0,0 +1,104 @@

+             repagent - replication agent - a Qemu module for enabling
continuous async replication of VM volumes

+

+Introduction

+             This document describes a feature in Qemu - a replication
agent (AKA Repagent).

+             The Repagent is a new module that exposes an API to an
external replication system (AKA Rephub).

+             This API allows a Rephub to communicate with a Qemu VM and
continuously replicate its volumes.

+             The imlementation of a Rephub is outside of the scope of
this document. There may be several various Rephub

+             implenetations using the same repagent in Qemu.

+

+Main feature of Repagent

+             Repagent does the following:

+             * Report volumes - report a list of all volumes in a VM to
the Rephub.
Does the query-block QMP command give you what you need?
I'll look into it.
+             * Report writes to a volume - send all writes made to a
protected volume to the Rephub.

+                             The reporting of an IO is asyncronuous -
i.e. the IO is not delayed by the Repagent to get any acknowledgement
from the Rephub.
+                             It is only copied to the Rephub.

+             * Read a protected volume - allows the Rephub to read a
protected volume, to enable the protected hub to syncronize the content
of a protected volume.
We were discussing using NBD as the protocol for any data that is
transferred from/to the replication hub, so that we can use the existing
NBD client and server code that qemu has. Seems you came to the
conclusion to use different protocol? What are the reasons?
Initially I thought there will have to be more functionality in the agent.
Now it seems that you're right, and Stefan also pointed out something similar. Let me think about how I can get the same functionality with NBD (or iScsi) server and client.

The other message types could possibly be implemented as QMP commands. I
guess we might need to attach multiple QMP monitors for this to work
(one for libvirt, one for the rephub). I'm not sure if there is a
fundamental problem with this or if it just needs to be done.
+

+Description of the Repagent module

+

+Build and run options

+             New configure option: --enable-replication

+             New command line option:

+             -repagent [hub IP/name]
You'll probably want a monitor command to enable this at runtime.
Yep.
+
Enable replication support for disks

+
hub is the ip or name of the machine running the replication hub.

+

+Module APIs

+             The Repagent module interfaces two main components:

+             1. The Rephub - An external API based on socket messages

+             2. The generic block layer- block.c

+

+             Rephub message API

+                             The external replication API is a message
based API.

+                             We won't go into the structure of the
messages here - just the sematics.

+

+                             Messages list

+                                             (The updated list and
comments are in Rephub_cmds.h)

+

+                                             Messages from the Repagent
to the Rephub:

+                                             * Protected write

+                                                             The
Repagent sends each write to a protected volume to the hub with the IO
status.

+                                                             In case
the status is bad the write content is not sent

+                                             * Report VM volumes

+                                                             The agent
reports all the volumes of the VM to the hub.

+                                             * Read Volume Response

+                                                             A response
to a Read Volume Request

+                                                             Sends the
data read from a protected volume to the hub

+                                             * Agent shutdown

+                                                             Notifies
the hub that the agent is about to shutdown.

+                                                             This
allows a graceful shutdown. Any disconnection of an agent without

+                                                             sending
this command will result in a full sync of the VM volumes.
What does "full sync" mean, what data is synced with which other place?
Is it bad when this happens just because the network is down for a
moment, but the VM actually keeps running?
Full sync means reading the entire volume.
It is bad when it happens because of a short network outage, but I think that it's a good 'intermediate' step to do so. We can first build a system which assumes that the connection between the agent and the Rephub is solid, and on a next stage add a bitmap mechanism in the agent that will optimize it - to overcome outages without full sync.
+

+                                             Messages from the Rephub
to the Repagent:

+                                             * Start protect

+                                                             The hub
instructs the agent to start protecting a volume. When a volume is protected

+                                                             all its
writes are sent to to the hub.

+                                                             With this
command the hub also assigns a volume ID to the given volume name.

+                                             * Read volume request

+                                                             The hub
issues a read IO to a protected volume.

+                                                             This
command is used during sync - when the hub needs to read unsyncronized

+                                                             sections
of a protected volume.

+                                                             This
command is a request, the read data is returned by the read volume
response message (see above).

+             block.c API

+                             The API to the generic block storage layer
contains 3 functionalities:

+                             1. Handle writes to protected volumes

+                                             In bdrv_co_do_writev, each
write is reported to the Repagent module.

+                             2. Handle each new volume that registers

+                                             In bdrv_open - each new
bottom-level block driver that registers is reported.
Could probably be a QMP event.
OK
+                             2. Read from a volume

+                                             Repagent calls
bdrv_aio_readv to handle read requests coming from the hub.

+

+

+General description of a Rephub  - a replication system the repagent
connects to

+             This section describes in high level a sample Rephub - a
replication system that uses the repagent API

+             to replicate disks.

+             It describes a simple Rephub that comntinuously maintains
a mirror of the volumes of a VM.

+

+             Say we have a VM we want to protect - call it PVM, say it
has 2 volumes - V1, V2.

+             Our Rephub is called SingleRephub - a Rephub protecting a
single VM.

+

+             Preparations

+             1. The user chooses a host to rub SingleRephub - a
different host than PVM, call it Host2

+             2. The user creates two volumes on Host2 - same sizes of
V1 and V2, call them V1R (V1 recovery) and V2R.

+             3. The user runs SingleRephub process on Host2, and gives
V1R and V2R as command line arguments.

+                             From now on SingleRephub waits for the
protected VM repagent to connect.

+             4. The user runs the protected VM PVM - and uses the
switch -repagent<Host2 IP>.

+

+             Runtime

+             1. The repagent module connects to SingleRephub on startup.

+             2. repagent reports V1 and V2 to SingleRephub.

+             3. SingleRephub starts to perform an initial
synchronization of the protected volumes-

+                             it reads each protected volume (V1 and V2)
- using read volume requests - and copies the data into the

+                             recovery volume V1R and V2R.
Are you really going to do this on every start of the VM? Comparing the
whole content of an image will take quite some time.
It is done when you first start protect a volume, not each time a VM boots. A VM can reboot without needing a full sync.

+             4. SingleRephub enters 'protection' mode - each write to
the protected volume is sent by the repagent to the Rephub,

+                             and the Rephub performs the write on the
matching recovery volume.

+

+             * Note that during stage 3 writes to the protected volumes
are not ignored - they're kept in a bitmap,

+                             and will be read again when stage 3 ends,
in an interative convergin process.

+

+             This flow continuously maintains an updated recovery volume.

+             If the protected system is damaged, the user can create a
new VM on Host2 with the replicated volumes attached to it.

+             The new VM is a replica of the protected system.
Have you meanwhile had the time to take a look at Kemari and check how
big the overlap is?
No. What's Kemari? I'll look it up.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]