[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy
From: |
Anthony Liguori |
Subject: |
Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy |
Date: |
Tue, 01 Mar 2011 10:51:42 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101027 Lightning/1.0b1 Thunderbird/3.0.10 |
On 03/01/2011 04:39 AM, Avi Kivity wrote:
On 02/28/2011 08:12 PM, Anthony Liguori wrote:
On Feb 28, 2011 11:47 AM, "Avi Kivity" <address@hidden
<mailto:address@hidden>> wrote:
>
> On 02/28/2011 07:33 PM, Anthony Liguori wrote:
>>
>>
>> >
>> > You're just ignoring what I've written.
>>
>> No, you're just impervious to my subtle attempt to refocus the
discussion on solving a practical problem.
>>
>> There's a lot of good, reasonably straight forward changes we can
make that have a high return on investment.
>>
>
> Is making qemu the authoritative source of configuration
information a straightforward change? Is the return on it high? Is
the investment low?
I think this is where we fundamentally disagree. My position is that
QEMU is already the authoritative source. Having a state file
doesn't change anything.
Do a hot unplug of a network device with upstream libvirt with
acpiphp unloaded, consult libvirt and then consult the monitor to see
who has the right view of the guests config.
libvirt is right and the monitor is wrong.
On real hardware, calling _EJ0 doesn't affect the configuration one
little bit (if I understand it correctly). It just turns off power to
the slot. If you power-cycle, the card will be there.
It's up to the hardware vendor. Since it's ACPI, it can result in any
number of operations. Usually, there's some logic to flip on an LED or
something.
There's nothing that prevents a vendor from ejecting the card. My point
is that there aren't cleanly separated lines in the real world.
To me, that's the definition of authoritative.
> "No" to all three (ignoring for the moment whether it is good or
not, which we were debating).
>
>
>> The only suggestion I'm making beyond Marcelo's original patch is
that we use a structured format and that we make it possible to use
the same file to solve this problem in multiple places.
>>
>
> No, you're suggesting a lot more than that.
That's exactly what I'm suggesting from a technical perspective.
Unless I'm hallucinating, you're suggesting quite a bit more. A
revolution in how qemu is to be managed.
Let me take another route to see if I can't persuade you.
First, let's clarify your proposal. You want to introduce a new block
format
that references to block devices. It may also store a dirty bitmap to keep
track of which blocks are out of sync. Hopefully, it goes without saying
that the dirty bitmap is strictly optional (it's a performance
optimization) so
let's ignore it.
Your format, as a text file, looks like:
[raid1]
primary=diska.img
secondary=diskb.img
active=primary
To use it, here's the sequence:
0) qemu uses disk A for a block device
1) create a raid1 block device pointing to disk A and disk B.
2) management tool asks qemu to us the new raid1 block device.
3) qemu acks (2)
4) at some point, the mirror completes, writes are going to both disks
5) qemu sends out an event indicating that the disks are in sync
6) management tool then sends a command to fail over to disk B
7) qemu acks (6)
We're making the management tool the "authoritative" source of how to launch
QEMU. That means that the management tool ultimately determines which
command
line to relaunch QEMU with.
Here are the races:
A) If QEMU crashes between (2) and (3), it may have issues a write to
the new
raid1 block device before the management tool sees (3). If this
happens,
when the management tool restarts QEMU with disk A, we're left with a
dangling raid1 block device. Not a critical failure, but not ideal.
B) If QEMU crashes between (6) and (7), QEMU may have started writing to
disk
B before the management tool sees (7). This means that the
management tool
will create the guest with the raid1 block device which no longer is the
correct disk. This could fail in subtly bad ways. Depending on how
read
is implemented (if you try to do striping for instance), bad data
could be
returned. You could try to implement a policy of always reading
from B if
the block has been copied but this gets harry really quickly. It's
definitely not RAID1 anymore.
You may observe that the problem is not the RAID1 mechanism, but
changing from
using a normal device and the RAID1 mechanism. It would then be wise to
say,
let's always use this image format. Since that eliminates the race, we
don't
really need the copy bitmap anymore.
Now we're left with a simple format that just refers to two filenames.
However,
block devices are more than just a filename. It needs a format, cache
settings, etc. So let's put this all in the RAID1 block format. We
also need
a way to indicate which block device is selected.
Let's make it a text file for purposes of discussion. It will look
something
like:
[primary]
filename=diska.img
cache=none
format=raw
[secondary]
filename=diskb.img
cache=writethrough
format=qcow2
[global]
active=primary
Since we might want to mirror multiple drives at once, we should probablyn
support having multiple drives configured which means we need to not
just have
a single active entry, but an entry associated with a particular device.
[drive "diskA"]
filename=diska.img
cache=none
format=raw
[drive "diskB"]
filename=diskb.img
cache=writethrough
format=qcow2
[device "vda"]
drive=diskB
And this is exactly what I'm proposing. It's really the natural
generalization
of what you're proposing.
So basically, the only differences are:
1) always use the new RAID1 format
2) drop the progress bitmap
3) support multiple devices per file
4) let drive properties be specified beyond filename
All reasonable things to do.
Regards,
Anthony Liguori