[Qemu-block] blkdebug get_status bug [was: NBD structured reads vs. bloc

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-block] blkdebug get_status bug [was: NBD structured reads vs. bloc

From:	Eric Blake
Subject:	[Qemu-block] blkdebug get_status bug [was: NBD structured reads vs. block size]
Date:	Tue, 28 Aug 2018 16:59:25 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

[following up to a different set of emails]

On 08/28/2018 03:41 PM, Eric Blake wrote:

Revisiting this:

On 08/01/2018 09:41 AM, Eric Blake wrote:
Rich Jones pointed me to questionable behavior in qemu's NBD serverimplementation today: qemu advertises a minimum block size of 512 toany client that promises to honor block sizes, but when serving up araw file that is not aligned to a sector boundary, attempting to readthat final portion of the file results in a structured read with twochunks, the first for the data up to the end of the actual file, andthe second reporting a hole for the rest of the sector. If a client ispromising to obey block sizes on its requests, it seems odd that theserver is allowed to send a result that is not also aligned to blocksizes.
Right now, the NBD spec says that when structured replies are in use,then for a structured read:
     The server MAY split the reply into any number of content chunks;
     each chunk MUST describe at least one byte, although to minimize
     overhead, the server SHOULD use chunks with lengths and offsets as
     an integer multiple of 512 bytes, where possible (the first and
     last chunk of an unaligned read being the most obvious places for
     an exception).
I'm wondering if we should tighten that to require that the serverpartition the reply chunks to be aligned to the advertised minimumblock size (at which point, qemu should either advertise 1 instead of512 as the minimum size when serving up an unaligned file, or elseqemu should just send the final partial sector as a single data chunkrather than trying to report the last few bytes as a hole).
For comparison, on block status, we require:

    The server SHOULD use descriptor
     lengths that are an integer multiple of 512 bytes where possible
     (the first and last descriptor of an unaligned query being the
     most obvious places for an exception), and MUST use descriptor
     lengths that are an integer multiple of any advertised minimum
     block size.
And qemu as a client currently hangs up on any server that violatesthat requirement on block status (that is, when qemu as the servertries to send a block status that was not aligned to the advertisedblock size, qemu as the client flags it as an invalid server - whichmeans qemu as server is currently broken). So I'm thinking we shouldcopy that requirement onto servers for reads as well.
Vladimir pointed out that the problem is not necessarily just limited tothe implicit hole at the end of a file that was rounded up to sectorsize. Another case where sub-region changes occur in qemu is where youhave a backing file with 512-byte hole granularity (qemu-img create -fqcow2 -o cluster_size=512 backing.qcow2 100M) and an overlay with largergranularity (qemu-img create -f qcow2 -b backing.qcow2 -F qcow2 -ocluster_size=4k active.qcow2). On a cluster where the top layer defersto the underlying layer, it is possible to alternate between holes anddata at sector boundaries but at subsets of the cluster boundary of thetop layer. As long as qemu advertises a minimum block size of 512rather than the cluster size, then this isn't a problem, but if qemuwere to change to report the qcow2 cluster size as its minimum I/O(rather than merely its preferred I/O, because it can doread-modify-write on data smaller than a cluster), this would be anothercase where unaligned replies might confuse a client.

So, I tried to actually create this scenario, to see what actuallyhappens, and we have a worse bug that needs to be resolved first. Thatis, bdrv_block_status_above() chokes when there is a blkdebug node inthe works:


$ qemu-img create -f qcow2 -o cluster_size=512 back.qcow2 100M
$ qemu-io -c 'w P1 0 512' -c 'w -P1 1k 512' -f qcow2 back.qcow2
$ qemu-img create -f qcow2 -F qcow2 -b back.qcow2 \
    -o cluster_size=1M top.qcow2
$ qemu-img map --output=json -f qcow2 top.qcow2

[{ "start": 0, "length": 512, "depth": 1, "zero": false, "data": true,"offset": 27648},

{ "start": 512, "length": 512, "depth": 1, "zero": true, "data": false},

{ "start": 1024, "length": 512, "depth": 1, "zero": false, "data": true,"offset": 28160},{ "start": 1536, "length": 104856064, "depth": 1, "zero": true, "data":false}]

$ qemu-img map --output=json --image-opts \
   driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
image.file.filename=top.qcow2,align=4k

[{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data":false}]

Yikes! Adding blkdebug says there is no data in the file at all!Actions like 'qemu-img convert' for copying between images would thusbehave differently on a blkdebug image than they would on the realimage, which somewhat defeats the purpose of blkdebug being a filter node.


$ ./qemu-io --image-opts \
   driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
image.file.filename=top.qcow2,align=4k
qemu-io> r -P1 0 512
read 512/512 bytes at offset 0
512 bytes, 1 ops; 0.0002 sec (1.782 MiB/sec and 3649.6350 ops/sec)
qemu-io> r -P0 512 512
read 512/512 bytes at offset 512
512 bytes, 1 ops; 0.0002 sec (2.114 MiB/sec and 4329.0043 ops/sec)
qemu-io>

Meanwhile, the data from the backing file is clearly visible when read.So the bug must lie somewhere in the get_status operation. Lookingcloser, I see this in bdrv_co_block_status_above():


2242        for (p = bs; p != base; p = backing_bs(p)) {

When qcow2 is directly opened, this iterates to back.qcow2 and sees thatthere is data in the first cluster, changing the overall status reportedto the caller. But when the qcow2 is hidden by blkdebug, backing_bs()states that blkdebug has no backing image, and terminates the loop earlywith JUST the status of the (empty) top file, rather than properlymerging in the status from the backing file.

I don't know if the bug lies in backing_bs(), or in the blkdebug driver,or in the combination of the two. Maybe it is as simple as fixingbacking_bs() such that on a filter node bs, it defers tobacking_bs(bs->file->bs), to make filter nodes behave as if they havethe same backing file semantics as what they are wrapping.

I was _trying_ to figure out if the block layer and/or blkdebug alsoneeds to perform rounding of get_status results: ifbs->bl.request_alignment is bigger at the current layer than what it isin the underlying protocol and/or backing layer, who is responsible forrounding the block status up to the proper alignment boundaries exposedby the layer being questioned? Or should we instead make sure that NBDadvertises the smallest alignment of anything in the chain, rather thanlimiting itself to just the alignment of the top layer in the chain?But since the graph can be modified on the fly, it's possible that thesmallest alignment anywhere in the chain can change over time.

And when there is no backing file in the mix, blkdebug does indeed havethe problem that it reports boundaries at an alignment smaller than whatit was declared to honor:


$  ./qemu-img map --output=json --image-opts \
   driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
image.file.filename=back.qcow2,align=4k

[{ "start": 0, "length": 512, "depth": 0, "zero": false, "data": true,"offset": 27648},

{ "start": 512, "length": 512, "depth": 0, "zero": true, "data": false},

{ "start": 1024, "length": 512, "depth": 0, "zero": false, "data": true,"offset": 28160},{ "start": 1536, "length": 104856064, "depth": 0, "zero": true, "data":false}]

Presumably, when rounding a collection of smaller statuses from anunderlying layer into the aligned status of a current layer withstricter alignment, the sane way to do it would be treatingBDRV_BLOCK_DATA/BDRV_BLOCK_ALLOCATED as sticky set (if any subset haddata, the overall area should report data), BDRV_BLOCK_ZERO as stickyclear (if any subset does not report zero, the overall area cannotreport zero), BDRV_BLOCK_OFFSET_VALID as sticky unset (if any subsetdoes not have a valid mapping, or if valid mappings are not contiguous,the overall area cannot report a mapping). But that logic soundscomplicated enough that it's probably better to do it just once in theblock layer, rather than having to repeat it in the various blockdrivers that actually have to cope with the possibility of a protocol orbacking layer with a smaller granularity than the current format layer.And maybe we want it to be controlled by a flag (where some callers wantto know as precise data as possible, even when the answers aresubdivided smaller than the request_alignment of the initial query;while other callers like NBD want to guarantee that the answer isproperly rounded to the request_alignment).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-block] NBD structured reads vs. block size, Eric Blake, 2018/08/01
- Re: [Qemu-block] NBD structured reads vs. block size, Nir Soffer, 2018/08/01
  - Re: [Qemu-block] NBD structured reads vs. block size, Richard W.M. Jones, 2018/08/01
    - Re: [Qemu-block] NBD structured reads vs. block size, Eric Blake, 2018/08/01
- Re: [Qemu-block] NBD structured reads vs. block size, Eric Blake, 2018/08/28
  - [Qemu-block] blkdebug get_status bug [was: NBD structured reads vs. block size], Eric Blake <=
    - Re: [Qemu-block] [Qemu-devel] blkdebug get_status bug [was: NBD structured reads vs. block size], Max Reitz, 2018/08/29
    - Re: [Qemu-block] [Qemu-devel] blkdebug get_status bug [was: NBD structured reads vs. block size], Max Reitz, 2018/08/29
  - Re: [Qemu-block] NBD structured reads vs. block size, Wouter Verhelst, 2018/08/29

Prev by Date: Re: [Qemu-block] [PATCH 5/7] block/mirror: utilize job_exit shim
Next by Date: Re: [Qemu-block] [PATCH 5/7] block/mirror: utilize job_exit shim
Previous by thread: Re: [Qemu-block] NBD structured reads vs. block size
Next by thread: Re: [Qemu-block] [Qemu-devel] blkdebug get_status bug [was: NBD structured reads vs. block size]
Index(es):
- Date
- Thread