qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] replication agent module


From: Orit Wasserman
Subject: Re: [Qemu-devel] [RFC PATCH] replication agent module
Date: Wed, 08 Feb 2012 14:29:13 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0

On 02/07/2012 04:45 PM, Ori Mamluk wrote:
> On 07/02/2012 15:34, Kevin Wolf wrote:
>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>> Repagent is a new module that allows an external replication system to
>>> replicate a volume of a Qemu VM.
>>>
>>> This RFC patch adds the repagent client module to Qemu.
>>>
>>>
>>>
>>> Documentation of the module role and API is in the patch at
>>> replication/qemu-repagent.txt
>>>
>>>
>>>
>>> The main motivation behind the module is to allow replication of VMs in
>>> a virtualization environment like RhevM.
>>>
>>> To achieve this we need basic replication support in Qemu.
>>>
>>>
>>>
>>> This is the first submission of this module, which was written as a
>>> Proof Of Concept, and used successfully for replicating and recovering a
>>> Qemu VM.
>> I'll mostly ignore the code for now and just comment on the design.
> That's fine. The code was mainly for my understanding of the system.
>> One thing to consider for the next version of the RFC would be to split
>> this in a series smaller patches. This one has become quite large, which
>> makes it hard to review (and yes, please use git send-email).
>>
>>> Points and open issues:
>>>
>>> *             The module interfaces the Qemu storage stack at block.c
>>> generic layer. Is this the right place to intercept/inject IOs?
>> There are two ways to intercept I/O requests. The first one is what you
>> chose, just add some code to bdrv_co_do_writev, and I think it's
>> reasonable to do this.
>>
>> The other one would be to add a special block driver for a replication:
>> protocol that writes to two different places (the real block driver for
>> the image, and the network connection). Generally this feels even a bit
>> more elegant, but it brings new problems with it: For example, when you
>> create an external snapshot, you need to pay attention not to lose the
>> replication because the protocol is somewhere in the middle of a backing
>> file chain.
> Yes. With this solution we'll have to somehow make sure that the replication 
> driver is closer to the guest than any driver which alters the IO.
> 
>>
>>> *             The patch contains performing IO reads invoked by a new
>>> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
>>> is not protected by any lock – is this OK?
>> No, definitely not. Block layer code expects that it holds
>> qemu_global_mutex.
>>
>> I'm not sure if a thread is the right solution. You should probably use
>> something that resembles other asynchronous code in qemu, i.e. either
>> callback or coroutine based.
> I call bdrv_aio_readv - which in my understanding creates a co-routing, so my 
> current solution is co-routines based. Did I get something wrong?
> 
>>
>>> *             VM ID – the replication system implies an environment with
>>> several VMs connected to a central replication system (Rephub).
>>>                  This requires some sort of identification for a VM. The
>>> current patch does not include a VM ID – I did not find any adequate ID
>>> to use.
>> The replication hub already opened a connection to the VM, so it somehow
>> managed to know which VM this process represents, right?
> The current design has the server at the Rephub side, so the VM connects to 
> the Rephub, and not the other way around.
> The VM could be instructed to "enable protection" by a monitor command, and 
> then it connects to the 'known' Rephub.
>> The unique ID would be something like the PID of the VM or the file
>> descriptor of the communication channel to it.
> The PID might be useful - we'll later need to correlate it to the way Rhevm 
> identifies the machine, but not right now...
>>> diff --git a/Makefile b/Makefile
>>>
>>> index 4f6eaa4..a1b3701 100644
>>>
>>> --- a/Makefile
>>>
>>> +++ b/Makefile
>>>
>>> @@ -149,9 +149,9 @@ qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o cmd.o
>>> qemu-ga.o: $(GENERATED_HEADERS
>>>
>>> tools-obj-y = qemu-tool.o $(oslib-obj-y) $(trace-obj-y) \
>>>
>>>                 qemu-timer-common.o cutils.o
>>>
>>> -qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
>>>
>>> -qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
>>>
>>> -qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
>>>
>>> +qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
>>> $(replication-obj-y)
>>>
>>> +qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
>>> $(replication-obj-y)
>>>
>>> +qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
>>> $(replication-obj-y)
>> $(replication-obj-y) should be included in $(block-obj-y) instead
>>
>>
>>> @@ -2733,6 +2739,7 @@ echo "curl support      $curl"
>>>
>>> echo "check support     $check_utests"
>>>
>>> echo "mingw32 support   $mingw32"
>>>
>>> echo "Audio drivers     $audio_drv_list"
>>>
>>> +echo "Replication          $replication"
>>>
>>> echo "Extra audio cards $audio_card_list"
>>>
>>> echo "Block whitelist   $block_drv_whitelist"
>>>
>>> echo "Mixer emulation   $mixemu"
>> Why do you add it in the middle rather than at the end?
> No reason, I'll change it.
>>
>>> diff --git a/replication/qemu-repagent.txt b/replication/qemu-repagent.txt
>>>
>>> new file mode 100755
>>>
>>> index 0000000..e3b0c1e
>>>
>>> --- /dev/null
>>>
>>> +++ b/replication/qemu-repagent.txt
>>>
>>> @@ -0,0 +1,104 @@
>>>
>>> +             repagent - replication agent - a Qemu module for enabling
>>> continuous async replication of VM volumes
>>>
>>> +
>>>
>>> +Introduction
>>>
>>> +             This document describes a feature in Qemu - a replication
>>> agent (AKA Repagent).
>>>
>>> +             The Repagent is a new module that exposes an API to an
>>> external replication system (AKA Rephub).
>>>
>>> +             This API allows a Rephub to communicate with a Qemu VM and
>>> continuously replicate its volumes.
>>>
>>> +             The imlementation of a Rephub is outside of the scope of
>>> this document. There may be several various Rephub
>>>
>>> +             implenetations using the same repagent in Qemu.
>>>
>>> +
>>>
>>> +Main feature of Repagent
>>>
>>> +             Repagent does the following:
>>>
>>> +             * Report volumes - report a list of all volumes in a VM to
>>> the Rephub.
>> Does the query-block QMP command give you what you need?
> I'll look into it.
>>> +             * Report writes to a volume - send all writes made to a
>>> protected volume to the Rephub.
>>>
>>> +                             The reporting of an IO is asyncronuous -
>>> i.e. the IO is not delayed by the Repagent to get any acknowledgement
>>> from the Rephub.
>>> +                             It is only copied to the Rephub.
>>>
>>> +             * Read a protected volume - allows the Rephub to read a
>>> protected volume, to enable the protected hub to syncronize the content
>>> of a protected volume.
>> We were discussing using NBD as the protocol for any data that is
>> transferred from/to the replication hub, so that we can use the existing
>> NBD client and server code that qemu has. Seems you came to the
>> conclusion to use different protocol? What are the reasons?
> Initially I thought there will have to be more functionality in the agent.
> Now it seems that you're right, and Stefan also pointed out something similar.
> Let me think about how I can get the same functionality with NBD (or iScsi) 
> server and client.
>>
>> The other message types could possibly be implemented as QMP commands. I
>> guess we might need to attach multiple QMP monitors for this to work
>> (one for libvirt, one for the rephub). I'm not sure if there is a
>> fundamental problem with this or if it just needs to be done.
>>> +
>>>
>>> +Description of the Repagent module
>>>
>>> +
>>>
>>> +Build and run options
>>>
>>> +             New configure option: --enable-replication
>>>
>>> +             New command line option:
>>>
>>> +             -repagent [hub IP/name]
>> You'll probably want a monitor command to enable this at runtime.
> Yep.
>>> +
>>> Enable replication support for disks
>>>
>>> +
>>> hub is the ip or name of the machine running the replication hub.
>>>
>>> +
>>>
>>> +Module APIs
>>>
>>> +             The Repagent module interfaces two main components:
>>>
>>> +             1. The Rephub - An external API based on socket messages
>>>
>>> +             2. The generic block layer- block.c
>>>
>>> +
>>>
>>> +             Rephub message API
>>>
>>> +                             The external replication API is a message
>>> based API.
>>>
>>> +                             We won't go into the structure of the
>>> messages here - just the sematics.
>>>
>>> +
>>>
>>> +                             Messages list
>>>
>>> +                                             (The updated list and
>>> comments are in Rephub_cmds.h)
>>>
>>> +
>>>
>>> +                                             Messages from the Repagent
>>> to the Rephub:
>>>
>>> +                                             * Protected write
>>>
>>> +                                                             The
>>> Repagent sends each write to a protected volume to the hub with the IO
>>> status.
>>>
>>> +                                                             In case
>>> the status is bad the write content is not sent
>>>
>>> +                                             * Report VM volumes
>>>
>>> +                                                             The agent
>>> reports all the volumes of the VM to the hub.
>>>
>>> +                                             * Read Volume Response
>>>
>>> +                                                             A response
>>> to a Read Volume Request
>>>
>>> +                                                             Sends the
>>> data read from a protected volume to the hub
>>>
>>> +                                             * Agent shutdown
>>>
>>> +                                                             Notifies
>>> the hub that the agent is about to shutdown.
>>>
>>> +                                                             This
>>> allows a graceful shutdown. Any disconnection of an agent without
>>>
>>> +                                                             sending
>>> this command will result in a full sync of the VM volumes.
>> What does "full sync" mean, what data is synced with which other place?
>> Is it bad when this happens just because the network is down for a
>> moment, but the VM actually keeps running?
> Full sync means reading the entire volume.
> It is bad when it happens because of a short network outage, but I think that 
> it's a good 'intermediate' step to do so.
> We can first build a system which assumes that the connection between the 
> agent and the Rephub is solid, and on a next stage add a bitmap mechanism in 
> the agent that will optimize it - to overcome outages without full sync.
>>> +
>>>
>>> +                                             Messages from the Rephub
>>> to the Repagent:
>>>
>>> +                                             * Start protect
>>>
>>> +                                                             The hub
>>> instructs the agent to start protecting a volume. When a volume is protected
>>>
>>> +                                                             all its
>>> writes are sent to to the hub.
>>>
>>> +                                                             With this
>>> command the hub also assigns a volume ID to the given volume name.
>>>
>>> +                                             * Read volume request
>>>
>>> +                                                             The hub
>>> issues a read IO to a protected volume.
>>>
>>> +                                                             This
>>> command is used during sync - when the hub needs to read unsyncronized
>>>
>>> +                                                             sections
>>> of a protected volume.
>>>
>>> +                                                             This
>>> command is a request, the read data is returned by the read volume
>>> response message (see above).
>>>
>>> +             block.c API
>>>
>>> +                             The API to the generic block storage layer
>>> contains 3 functionalities:
>>>
>>> +                             1. Handle writes to protected volumes
>>>
>>> +                                             In bdrv_co_do_writev, each
>>> write is reported to the Repagent module.
>>>
>>> +                             2. Handle each new volume that registers
>>>
>>> +                                             In bdrv_open - each new
>>> bottom-level block driver that registers is reported.
>> Could probably be a QMP event.
> OK
>>> +                             2. Read from a volume
>>>
>>> +                                             Repagent calls
>>> bdrv_aio_readv to handle read requests coming from the hub.
>>>
>>> +
>>>
>>> +
>>>
>>> +General description of a Rephub  - a replication system the repagent
>>> connects to
>>>
>>> +             This section describes in high level a sample Rephub - a
>>> replication system that uses the repagent API
>>>
>>> +             to replicate disks.
>>>
>>> +             It describes a simple Rephub that comntinuously maintains
>>> a mirror of the volumes of a VM.
>>>
>>> +
>>>
>>> +             Say we have a VM we want to protect - call it PVM, say it
>>> has 2 volumes - V1, V2.
>>>
>>> +             Our Rephub is called SingleRephub - a Rephub protecting a
>>> single VM.
>>>
>>> +
>>>
>>> +             Preparations
>>>
>>> +             1. The user chooses a host to rub SingleRephub - a
>>> different host than PVM, call it Host2
>>>
>>> +             2. The user creates two volumes on Host2 - same sizes of
>>> V1 and V2, call them V1R (V1 recovery) and V2R.
>>>
>>> +             3. The user runs SingleRephub process on Host2, and gives
>>> V1R and V2R as command line arguments.
>>>
>>> +                             From now on SingleRephub waits for the
>>> protected VM repagent to connect.
>>>
>>> +             4. The user runs the protected VM PVM - and uses the
>>> switch -repagent<Host2 IP>.
>>>
>>> +
>>>
>>> +             Runtime
>>>
>>> +             1. The repagent module connects to SingleRephub on startup.
>>>
>>> +             2. repagent reports V1 and V2 to SingleRephub.
>>>
>>> +             3. SingleRephub starts to perform an initial
>>> synchronization of the protected volumes-
>>>
>>> +                             it reads each protected volume (V1 and V2)
>>> - using read volume requests - and copies the data into the
>>>
>>> +                             recovery volume V1R and V2R.
>> Are you really going to do this on every start of the VM? Comparing the
>> whole content of an image will take quite some time.
> It is done when you first start protect a volume, not each time a VM boots. A 
> VM can reboot without needing a full sync.
>>
>>> +             4. SingleRephub enters 'protection' mode - each write to
>>> the protected volume is sent by the repagent to the Rephub,
>>>
>>> +                             and the Rephub performs the write on the
>>> matching recovery volume.
>>>
>>> +
>>>
>>> +             * Note that during stage 3 writes to the protected volumes
>>> are not ignored - they're kept in a bitmap,
>>>
>>> +                             and will be read again when stage 3 ends,
>>> in an interative convergin process.
>>>
>>> +
>>>
>>> +             This flow continuously maintains an updated recovery volume.
>>>
>>> +             If the protected system is damaged, the user can create a
>>> new VM on Host2 with the replicated volumes attached to it.
>>>
>>> +             The new VM is a replica of the protected system.
>> Have you meanwhile had the time to take a look at Kemari and check how
>> big the overlap is?
> No. What's Kemari? I'll look it up.

Kemari is a fault tolerance solution for KVM : 
http://wiki.qemu.org/Features/FaultTolerance

It sync the guest memory to a remote instance by using the live migration 
mechanism in QEMU.
As for the image it assumes shared storage.

The similarity is that the synchronization is done when there is an IO event 
(not only block IO but also network).
It needs to trap the IO event and delay it till the sync is complete.

The code is based on an older QEMU version without coroutines.
Not sure how much it can help you.

Orit
>>
>> Kevin
> 
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]