[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support

From: Kevin Wolf
Subject: Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support
Date: Fri, 08 Jun 2012 23:08:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

Am 08.06.2012 20:33, schrieb Jeff Cody:
> On 06/08/2012 01:57 PM, Kevin Wolf wrote:
>> Am 08.06.2012 19:46, schrieb Jeff Cody:
>>> On 06/08/2012 12:11 PM, Kevin Wolf wrote:
>>>> Am 08.06.2012 16:32, schrieb Jeff Cody:
>>>>> On 06/08/2012 09:53 AM, Stefan Hajnoczi wrote:
>>>>>> On Fri, Jun 8, 2012 at 2:19 PM, Jeff Cody <address@hidden> wrote:
>>>>>>> On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote:
>>>>>>>> Note that block-commit cannot work on the top-most image since the
>>>>>>>> guest is still writing to that image and we might never be able to
>>>>>>>> copy all the data into the base image (the guest could write new data
>>>>>>>> as quickly as we copy it to the base).  The command should check for
>>>>>>>> this and reject the top-most image.
>>>>>>> By this you mean that you would like to disallow committing the
>>>>>>> top-level image to the base?  Perhaps there is a way to attempt to
>>>>>>> converge, and adaptively give more time to the co-routine if we are able
>>>>>>> to detect divergence.  This may require violating the 'speed' parameter,
>>>>>>> however, and make the commit less 'live'.
>>>>>> Yes, I think we should disallow merging down the topmost image.  The
>>>>>> convergence problem is the same issue that live migration has.  It's a
>>>>>> hard problem and not something that is essential for block-commit.  It
>>>>>> would add complexity into the block-commit implementation - the only
>>>>>> benefit would be that you can merge down the last COW snapshot.  We
>>>>>> already have an implementation that does this: the "commit" command
>>>>>> which stops the guest :).
>>>>> OK.  And, it is better to get this implemented now, and if desired, I
>>>>> suppose we can always revisit the convergence at a later date (or not at
>>>>> all, if there is no pressing case for it).
>>>> Not allowing live commit for the topmost image may be a good way to
>>>> break the problem down into more manageable pieces, but eventually the
>>>> feature isn't complete until you can do it.
>>>> Basically what you need there is the equivalent of an active mirror. I
>>>> wouldn't be surprised if actually code could be reused for both.
>>> I agree, doing it like mirroring for new writes on the top layer makes
>>> sense, as long as you are willing to violate the (optional) speed
>>> parameter.  I wouldn't think violating the speed parameter in that case
>>> would be an issue, as long as it is a documented affect of performing a
>>> live commit on the active (top) layer.
>> The question is what the speed parameter really means for a live commit
>> block job (or for an active mirror). I think it would make sense to
>> limit only the speed of I/O that doesn't come from the guest but is from
>> the background copying.
>> If you did apply it to guest I/O as well, then I think the logical
>> consequence would be that guest I/O is throttled as well because
>> requests only complete when they have reached both source and target.
>> (I know I'm happily mixing terminology of all kinds of live block jobs
>> here, hope you understand anyway what I mean ;-))
> I understand what you mean :) My thought is that the block commit
> 'speed' parameter would essentially only apply to 'old data', that is
> being copied in the commit coroutine, and it would be ignored (violated)
> for 'new data' in the mirroring-like case.
> I assumed we wouldn't want to throttle the guest I/O, but that may be a
> bad assumption on my part.

I don't disagree, I just wanted to point out that throttling the guest
would be another option. Applying the limit only to old data probably
makes most sense indeed.

>>>>> I am happy to do it that way.  I'll shift my focus to the atomic image
>>>>> reopen in r/w mode.  I'll go ahead and post my diagrams and other info
>>>>> for block-commit on the wiki, because I don't think it conflicts with we
>>>>> discussed above (although I will modify my diagrams to not show commit
>>>>> from the top-level image).  Of course, feel free to change it as
>>>>> necessary.
>>>> I may have mentioned it before, but just in case: I think Supriya's
>>>> bdrv_reopen() patches are a good starting point. I don't know why they
>>>> were never completed, but I think we all agreed on the general design,
>>>> so it should be possible to pick that up.
>>>> Though if you have already started with your own work on it, Jeff, I
>>>> expect that it won't be much different because it's basically the same
>>>> transactional approach that you know and that we already used for group
>>>> snapshots.
>>> I will definitely use parts of Supriya's as it makes sense - what I
>>> started work on is similar to bdrv_append() and the current transaction
>>> approach, so there will be plenty in common to reuse, even with some
>>> differences.
>> I'm not sure how far a bdrv_append-like approach will work in this case.
>> The problem is that you must not open any image r/w twice. The policy is
>> that you can open an image either once r/w or multiple times read-only.
> I think there are 3 possibilities that would be OK: open once r/w, one
> or more times r/o, or once r/w + one or more times r/o.  Do you agree
> with this?  Is the latter case against block-layer policy?

Yes, it is. The problem is that image formats have data cached in
memory, and as soon as you have a writer, the other users of the same
image file - no matter whether r/w or r/o - get out of sync. They might
mix old (cached) matadata with other new metadata, and I think you can
imagine what mess would result.

Strictly speaking even raw caches things like the image size, so if the
r/w instance happens to resize the image, the additional r/o readers
would have a problem even with raw.

> In this special case for block commit, we need to do the reopen() to go
> from a r/o image to a r/w image.  If an image is opened r/o, the drv
> layer can do another open with a new descriptor, this time r/w, without
> closing the r/o instance.  If the new r/w open was successful, then the
> r/o instance can be closed, and some bdrv_append()-like operations done.
> Otherwise, like the current transactional code, we can just 'walk away'
> by only closing the new image (if it got that far), and the existing BDS
> with the r/o image is safely unchanged (and never closed, so no
> unrecoverable error paths).

Yes. It's just that you need to be very careful. You would ask the qcow2
driver to reopen the image, and it would take care to invalidate any
cached data and then pass the request on to the next layer. When you
have drilled down through the stack to raw-posix, it would actually open
the second file descriptor and wait for a commit/abort that goes through
the whole stack again.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]