[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH 17/17] block: Make bdrv_is_allocate

From: Eric Blake
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH 17/17] block: Make bdrv_is_allocated_above() byte-based
Date: Wed, 10 May 2017 10:42:30 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0

On 04/24/2017 08:48 PM, Eric Blake wrote:
> On 04/24/2017 06:06 PM, John Snow wrote:
>> On 04/11/2017 06:29 PM, Eric Blake wrote:
>>> We are gradually moving away from sector-based interfaces, towards
>>> byte-based.  In the common case, allocation is unlikely to ever use
>>> values that are not naturally sector-aligned, but it is possible
>>> that byte-based values will let us be more precise about allocation
>>> at the end of an unaligned file that can do byte-based access.
>>> Changing the signature of the function to use int64_t *pnum ensures
>>> that the compiler enforces that all callers are updated.  For now,
>>> the io.c layer still assert()s that all callers are sector-aligned,
>>> but that can be relaxed when a later patch implements byte-based
>>> block status.  Therefore, for the most part this patch is just the
>>> addition of scaling at the callers followed by inverse scaling at
>>> bdrv_is_allocated().  But some code, particularly stream_run(),
>>> gets a lot simpler because it no longer has to mess with sectors.
>>> +++ b/block/io.c
>>> @@ -1930,52 +1930,46 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState 
>>> *bs, int64_t offset,
>>>  /*
>>>   * Given an image chain: ... -> [BASE] -> [INTER1] -> [INTER2] -> [TOP]
>>>   *
>>> - * Return true if the given sector is allocated in any image between
>>> + * Return true if the given offset is allocated in any image between
>> perhaps "range" instead of "offset"?
>>>   * BASE and TOP (inclusive).  BASE can be NULL to check if the given
>>> - * sector is allocated in any image of the chain.  Return false otherwise,
>>> + * offset is allocated in any image of the chain.  Return false otherwise,
>>>   * or negative errno on failure.
> Seems reasonable.

Actually, not quite. Suppose we have the following sector allocations:

backing: -- -- XX --
active:  -- -- -- XX

bdrv_is_allocated_above(active, NULL, 0, 2048, &num)

will return 0 with num set to 1024 (offset is not allocated, and the
not-allocated range is at least 1024 bytes).  But calling

bdr_is_allocated_above(backing, NULL, 1024, 1024, &num)

will return 1 with num set to 512 (the entire range is not allocated,
but the 512-byte prefix of the range IS allocated).  Meanwhile, note that:

bdrv_is_allocated_above(active, NULL, 1024, 1024, &num)

will ALSO return 1 with num set to 512, even though it would be nicer if
it could return 1024 (from the active layer, all 1024 bytes of the given
range ARE allocated, just not from the same location).  So callers have
to manually coalesce multiple bdrv_is_allocated_above() calls to get a
full picture of what is allocated.

The same is ALSO true if you have a fragmented image.  For example:

$ qemu-img create -f qcow2 -o cluster_size=1m file3 10m
$ qemu-io -f qcow2 -c 'w 0 5m' -c 'discard 0 2m' -c 'w 1m 1m' \
  -c 'w 0 1m' -c 'w 8m 2m' file3

The image is now fragmented (the clusters at 0 and 1m swapped mappings):

[{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data":
true, "offset": 6291456},
{ "start": 1048576, "length": 1048576, "depth": 0, "zero": false,
"data": true, "offset": 5242880},
{ "start": 2097152, "length": 3145728, "depth": 0, "zero": false,
"data": true, "offset": 7340032},
{ "start": 5242880, "length": 3145728, "depth": 0, "zero": true, "data":
{ "start": 8388608, "length": 2097152, "depth": 0, "zero": false,
"data": true, "offset": 10485760}]

'qemu-io alloc' and 'qemu-io map' are callers which coalesce similar

$ ./qemu-io -f qcow2 file3
qemu-io> alloc 0 10m
7340032/10485760 bytes allocated at offset 0 bytes
qemu-io> map
5 MiB (0x500000) bytes     allocated at offset 0 bytes (0x0)
3 MiB (0x300000) bytes not allocated at offset 5 MiB (0x500000)
2 MiB (0x200000) bytes     allocated at offset 8 MiB (0x800000)

although tracing under gdb showed 5 calls to bdrv_is_allocated() during
'alloc' (at [0, 10m], [1m, 9m], [2m, 8m], [5m, 5m], [8m, 2m]), and 8
calls during 'map' (at [0, 10m], [1m, 9m], [2m, 8m], [5m, 5m], [5m, 5m],
[8m, 2m], [8m, 2m], [10m, 0]).  Part of that is that the qemu-io map
implementation is rather inefficient - it makes a wasted query for 0
bytes, and at every transition between allocated and unallocated it ends
up asking for status on the same offset a second time around, rather
than remembering what it already learned on the previous iteration.

It might be worth followup patches to improve the efficiency of qemu-io
map; but also to change semantics such that bdrv_is_allocated_above()
gives the largest answer possible, rather than making the callers
coalesce identical answers.  Part of that would include teaching
.bdrv_co_get_block_status() new semantics: if file==NULL, then don't
return BDRV_BLOCK_OFFSET_VALID, but merely grab as much
BDRV_BLOCK_DATA/BDRV_BLOCK_ZERO information as possible (in spite of
fragmentation) [in the example above, it would mean returning
BDRV_BLOCK_DATA for 5m, rather than three separate returns of 1m/1m/3m];
as well as teaching bdrv_is_allocated_above() to concatenate all answers
along the backing chain until it reaches an actual change in status.
Looks like I've just added more work to my queue.

That said, there still has to be coalescing somewhere.  For example,
qcow2 images can easily return status for any range covered by a single
L2 cluster, but intentionally quits its response at the end of an L2
cluster because the cost of loading up the next cluster (with the
potential of dropping locks and racing with other threads doing writes)
becomes too hard to control within the qcow2 layer, and callers may need
to realize that the larger a request is on an actively-changing image,
the more likely the overall response can be inaccurate by the time
multiple sub-queries have been coalesced.

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]