qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 2/8] block: add live block commit functionality


From: Eric Blake
Subject: Re: [Qemu-devel] [PATCH 2/8] block: add live block commit functionality
Date: Fri, 14 Sep 2012 12:23:50 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120828 Thunderbird/15.0

On 09/14/2012 10:07 AM, Jeff Cody wrote:
>> Question: is it valid to have a qcow2 file whose size is smaller than
>> it's backing image?
> 
> I don't think so... however:
> 
>>  Suppose I have base[1M] <- mid[2M] <- top[3M] <-
>> active[3M], and request to commit top into base.  This bdrv_truncate()
>> means I will now have:
>>
>> base[3M] <- mid[2M] <- top[3M] <- active[3M].
>>
>> If I then abort the commit operation at this point, then we have the
>> situation of 'mid' reporting a smaller size than 'base' - which may make
>> 'mid' invalid.  And even if it is valid, what happens if I now request
>> to commit 'mid' into 'base', but 'base' already had data written past
>> the 2M mark before I aborted the first operation?
> 
> Once the commit starts, I don't know if you can safely abort it, and
> still count on 'mid' being valid.  Ignoring potential size differences,
> how would you ever know that what was written from 'top' into 'base' is
> compatible with what is present in 'mid'?

We chatted about this some more on IRC, and I'll attempt to summarize
the results of that conversation (correct me if I'm wrong)...

When committing across multiple images, there are four allocation cases
to consider:

1. unallocated in mid or top => nothing to do; base is already correct

2. allocated in mid but not top => copy from mid to base; as long as mid
is in the chain, both mid and top see the version in mid; as soon as mid
is removed from the chain, top sees the version in base

3. allocated in mid and in top => ultimately, we want to copy from top
to base.  We can also do an intermediate copy from mid to base, although
that is less efficient; as long as the copy from top to base happens
last.  As long as the sector remains allocated, then mid always sees its
own version, and top always sees its own version.

4. allocated in top but not mid => we want to copy from top to base, but
the moment we do that, if mid is still in the chain, then we have
invalidated the contents of mid.  However, as long as top remains
allocated, it sees its own version, and even if top is marked
unallocated, it would then see through to base and see correct contents
even though the intermediate file mid is inconsistent.

Use of block-commit has the potential to invalidate all images that are
dropped from the chain (namely, any time allocation scenario 4 is
present anywhere in the image); it is up to users to avoid using commit
if they have any other image chain sharing the part of the chain
discarded by this operation (someday, libvirt might track all storage
chains, and be able to prevent an attempt at a commit if it would strand
someone else's chain; but for now, we just document the issue).

Next, there is a question of whether invalidating the image up front is
acceptable, or whether we must go through gyrations to avoid
invalidation until after the image has been dropped from the chain.
That is, does the invalidation happen the moment the commit starts (and
can't be undone by an early abort), or can it be delayed until the point
that the image is actually dropped from the chain.  As long as the
current running qemu is the only entity using the portion of the chain
being dropped, then the timing does not matter, other than affecting
what optimizations we might be able to perform.

There is also a question of what happens if a commit is started, then
aborted, then restarted.  It is always safe to restart the same commit
from scratch, just not optimal, as the later run will spend time copying
identical content that was already in base on the first run.  The only
way to avoid copying sectors on a second run is to mark them unallocated
on the first run, but then we have the issue of consistency: if a sector
is allocated in both mid and top (scenario 3), and the first run copies
top into base and then marks top unallocated, then a future read of top
would pick up the contents from mid, which is wrong.  Therefore, we
cannot mark sectors unallocated unless we traverse them in a safe order.

I was able to come up with an algorithm that allows for faster restarts
of a commit operation, in order to avoid copying any sector into base
more than once (at least, insofar as top is not also an active image,
but we already deferred committing an active image for a later date).
It requires that every image being trimmed from the chain be r/w
(although only one image has to be r/w at a time), and that the copies
be done in a depth-first manner.  That is, the algorithm first visits
all allocted sectors in 'mid'; if they are not also allocated in top,
then the sector is copied into base and marked unallocated in mid.  When
mid is completed, it is removed from the chain, before proceeding to
top.  Eventually, all sectors will be copied into base, exactly once,
and the algorithm is restartable because it marks sectors unallocated
once base has the correct contents.  But it is more complex to implement.

In conclusion, since this stage of the implementation never marks
sectors unallocated, the use of the top of the chain is never
invalidated even if intermediate files remain in the chain but have
already been invalidated.  I'm okay with this patch going in as a first
approximation, and saving the complications of a depth-first approach
coupled with marking sectors unallocated as an optimization we can add
later (perhaps even by adding a flag to the JSON command to choose
whether to use the optimization, since it requires r/w on all images in
the chain but allows faster restarts; or to skip the optimization, since
it allows for fewer r/w images but slower restarts).  That is, this
patch series invalidates intermediate images at the start of the commit
operation, whereas the proposed optimization would defer invalidating
images until they have been removed from the chain, but it doesn't
affect the correctness of this phase of the patch series.

-- 
Eric Blake   address@hidden    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]