[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] blkdebug get_status bug [was: [Qemu-block] NBD structured r

From: Eric Blake
Subject: [Qemu-devel] blkdebug get_status bug [was: [Qemu-block] NBD structured reads vs. block size]
Date: Tue, 28 Aug 2018 16:59:25 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

[following up to a different set of emails]

On 08/28/2018 03:41 PM, Eric Blake wrote:
Revisiting this:

On 08/01/2018 09:41 AM, Eric Blake wrote:
Rich Jones pointed me to questionable behavior in qemu's NBD server implementation today: qemu advertises a minimum block size of 512 to any client that promises to honor block sizes, but when serving up a raw file that is not aligned to a sector boundary, attempting to read that final portion of the file results in a structured read with two chunks, the first for the data up to the end of the actual file, and the second reporting a hole for the rest of the sector. If a client is promising to obey block sizes on its requests, it seems odd that the server is allowed to send a result that is not also aligned to block sizes.

Right now, the NBD spec says that when structured replies are in use, then for a structured read:

     The server MAY split the reply into any number of content chunks;
     each chunk MUST describe at least one byte, although to minimize
     overhead, the server SHOULD use chunks with lengths and offsets as
     an integer multiple of 512 bytes, where possible (the first and
     last chunk of an unaligned read being the most obvious places for
     an exception).

I'm wondering if we should tighten that to require that the server partition the reply chunks to be aligned to the advertised minimum block size (at which point, qemu should either advertise 1 instead of 512 as the minimum size when serving up an unaligned file, or else qemu should just send the final partial sector as a single data chunk rather than trying to report the last few bytes as a hole).

For comparison, on block status, we require:

    The server SHOULD use descriptor
     lengths that are an integer multiple of 512 bytes where possible
     (the first and last descriptor of an unaligned query being the
     most obvious places for an exception), and MUST use descriptor
     lengths that are an integer multiple of any advertised minimum
     block size.

And qemu as a client currently hangs up on any server that violates that requirement on block status (that is, when qemu as the server tries to send a block status that was not aligned to the advertised block size, qemu as the client flags it as an invalid server - which means qemu as server is currently broken).  So I'm thinking we should copy that requirement onto servers for reads as well.

Vladimir pointed out that the problem is not necessarily just limited to the implicit hole at the end of a file that was rounded up to sector size.  Another case where sub-region changes occur in qemu is where you have a backing file with 512-byte hole granularity (qemu-img create -f qcow2 -o cluster_size=512 backing.qcow2 100M) and an overlay with larger granularity (qemu-img create -f qcow2 -b backing.qcow2 -F qcow2 -o cluster_size=4k active.qcow2). On a cluster where the top layer defers to the underlying layer, it is possible to alternate between holes and data at sector boundaries but at subsets of the cluster boundary of the top layer.  As long as qemu advertises a minimum block size of 512 rather than the cluster size, then this isn't a problem, but if qemu were to change to report the qcow2 cluster size as its minimum I/O (rather than merely its preferred I/O, because it can do read-modify-write on data smaller than a cluster), this would be another case where unaligned replies might confuse a client.

So, I tried to actually create this scenario, to see what actually happens, and we have a worse bug that needs to be resolved first. That is, bdrv_block_status_above() chokes when there is a blkdebug node in the works:

$ qemu-img create -f qcow2 -o cluster_size=512 back.qcow2 100M
$ qemu-io -c 'w P1 0 512' -c 'w -P1 1k 512' -f qcow2 back.qcow2
$ qemu-img create -f qcow2 -F qcow2 -b back.qcow2 \
    -o cluster_size=1M top.qcow2
$ qemu-img map --output=json -f qcow2 top.qcow2
[{ "start": 0, "length": 512, "depth": 1, "zero": false, "data": true, "offset": 27648},
{ "start": 512, "length": 512, "depth": 1, "zero": true, "data": false},
{ "start": 1024, "length": 512, "depth": 1, "zero": false, "data": true, "offset": 28160}, { "start": 1536, "length": 104856064, "depth": 1, "zero": true, "data": false}]
$ qemu-img map --output=json --image-opts \
[{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": false}]

Yikes! Adding blkdebug says there is no data in the file at all! Actions like 'qemu-img convert' for copying between images would thus behave differently on a blkdebug image than they would on the real image, which somewhat defeats the purpose of blkdebug being a filter node.

$ ./qemu-io --image-opts \
qemu-io> r -P1 0 512
read 512/512 bytes at offset 0
512 bytes, 1 ops; 0.0002 sec (1.782 MiB/sec and 3649.6350 ops/sec)
qemu-io> r -P0 512 512
read 512/512 bytes at offset 512
512 bytes, 1 ops; 0.0002 sec (2.114 MiB/sec and 4329.0043 ops/sec)

Meanwhile, the data from the backing file is clearly visible when read. So the bug must lie somewhere in the get_status operation. Looking closer, I see this in bdrv_co_block_status_above():

2242        for (p = bs; p != base; p = backing_bs(p)) {

When qcow2 is directly opened, this iterates to back.qcow2 and sees that there is data in the first cluster, changing the overall status reported to the caller. But when the qcow2 is hidden by blkdebug, backing_bs() states that blkdebug has no backing image, and terminates the loop early with JUST the status of the (empty) top file, rather than properly merging in the status from the backing file.

I don't know if the bug lies in backing_bs(), or in the blkdebug driver, or in the combination of the two. Maybe it is as simple as fixing backing_bs() such that on a filter node bs, it defers to backing_bs(bs->file->bs), to make filter nodes behave as if they have the same backing file semantics as what they are wrapping.

I was _trying_ to figure out if the block layer and/or blkdebug also needs to perform rounding of get_status results: if bs->bl.request_alignment is bigger at the current layer than what it is in the underlying protocol and/or backing layer, who is responsible for rounding the block status up to the proper alignment boundaries exposed by the layer being questioned? Or should we instead make sure that NBD advertises the smallest alignment of anything in the chain, rather than limiting itself to just the alignment of the top layer in the chain? But since the graph can be modified on the fly, it's possible that the smallest alignment anywhere in the chain can change over time.

And when there is no backing file in the mix, blkdebug does indeed have the problem that it reports boundaries at an alignment smaller than what it was declared to honor:

$  ./qemu-img map --output=json --image-opts \
[{ "start": 0, "length": 512, "depth": 0, "zero": false, "data": true, "offset": 27648},
{ "start": 512, "length": 512, "depth": 0, "zero": true, "data": false},
{ "start": 1024, "length": 512, "depth": 0, "zero": false, "data": true, "offset": 28160}, { "start": 1536, "length": 104856064, "depth": 0, "zero": true, "data": false}]

Presumably, when rounding a collection of smaller statuses from an underlying layer into the aligned status of a current layer with stricter alignment, the sane way to do it would be treating BDRV_BLOCK_DATA/BDRV_BLOCK_ALLOCATED as sticky set (if any subset had data, the overall area should report data), BDRV_BLOCK_ZERO as sticky clear (if any subset does not report zero, the overall area cannot report zero), BDRV_BLOCK_OFFSET_VALID as sticky unset (if any subset does not have a valid mapping, or if valid mappings are not contiguous, the overall area cannot report a mapping). But that logic sounds complicated enough that it's probably better to do it just once in the block layer, rather than having to repeat it in the various block drivers that actually have to cope with the possibility of a protocol or backing layer with a smaller granularity than the current format layer. And maybe we want it to be controlled by a flag (where some callers want to know as precise data as possible, even when the answers are subdivided smaller than the request_alignment of the initial query; while other callers like NBD want to guarantee that the answer is properly rounded to the request_alignment).

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

reply via email to

[Prev in Thread] Current Thread [Next in Thread]