[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-block] blkdebug get_status bug [was: NBD structured reads vs. bloc
From: |
Eric Blake |
Subject: |
[Qemu-block] blkdebug get_status bug [was: NBD structured reads vs. block size] |
Date: |
Tue, 28 Aug 2018 16:59:25 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
[following up to a different set of emails]
On 08/28/2018 03:41 PM, Eric Blake wrote:
Revisiting this:
On 08/01/2018 09:41 AM, Eric Blake wrote:
Rich Jones pointed me to questionable behavior in qemu's NBD server
implementation today: qemu advertises a minimum block size of 512 to
any client that promises to honor block sizes, but when serving up a
raw file that is not aligned to a sector boundary, attempting to read
that final portion of the file results in a structured read with two
chunks, the first for the data up to the end of the actual file, and
the second reporting a hole for the rest of the sector. If a client is
promising to obey block sizes on its requests, it seems odd that the
server is allowed to send a result that is not also aligned to block
sizes.
Right now, the NBD spec says that when structured replies are in use,
then for a structured read:
The server MAY split the reply into any number of content chunks;
each chunk MUST describe at least one byte, although to minimize
overhead, the server SHOULD use chunks with lengths and offsets as
an integer multiple of 512 bytes, where possible (the first and
last chunk of an unaligned read being the most obvious places for
an exception).
I'm wondering if we should tighten that to require that the server
partition the reply chunks to be aligned to the advertised minimum
block size (at which point, qemu should either advertise 1 instead of
512 as the minimum size when serving up an unaligned file, or else
qemu should just send the final partial sector as a single data chunk
rather than trying to report the last few bytes as a hole).
For comparison, on block status, we require:
The server SHOULD use descriptor
lengths that are an integer multiple of 512 bytes where possible
(the first and last descriptor of an unaligned query being the
most obvious places for an exception), and MUST use descriptor
lengths that are an integer multiple of any advertised minimum
block size.
And qemu as a client currently hangs up on any server that violates
that requirement on block status (that is, when qemu as the server
tries to send a block status that was not aligned to the advertised
block size, qemu as the client flags it as an invalid server - which
means qemu as server is currently broken). So I'm thinking we should
copy that requirement onto servers for reads as well.
Vladimir pointed out that the problem is not necessarily just limited to
the implicit hole at the end of a file that was rounded up to sector
size. Another case where sub-region changes occur in qemu is where you
have a backing file with 512-byte hole granularity (qemu-img create -f
qcow2 -o cluster_size=512 backing.qcow2 100M) and an overlay with larger
granularity (qemu-img create -f qcow2 -b backing.qcow2 -F qcow2 -o
cluster_size=4k active.qcow2). On a cluster where the top layer defers
to the underlying layer, it is possible to alternate between holes and
data at sector boundaries but at subsets of the cluster boundary of the
top layer. As long as qemu advertises a minimum block size of 512
rather than the cluster size, then this isn't a problem, but if qemu
were to change to report the qcow2 cluster size as its minimum I/O
(rather than merely its preferred I/O, because it can do
read-modify-write on data smaller than a cluster), this would be another
case where unaligned replies might confuse a client.
So, I tried to actually create this scenario, to see what actually
happens, and we have a worse bug that needs to be resolved first. That
is, bdrv_block_status_above() chokes when there is a blkdebug node in
the works:
$ qemu-img create -f qcow2 -o cluster_size=512 back.qcow2 100M
$ qemu-io -c 'w P1 0 512' -c 'w -P1 1k 512' -f qcow2 back.qcow2
$ qemu-img create -f qcow2 -F qcow2 -b back.qcow2 \
-o cluster_size=1M top.qcow2
$ qemu-img map --output=json -f qcow2 top.qcow2
[{ "start": 0, "length": 512, "depth": 1, "zero": false, "data": true,
"offset": 27648},
{ "start": 512, "length": 512, "depth": 1, "zero": true, "data": false},
{ "start": 1024, "length": 512, "depth": 1, "zero": false, "data": true,
"offset": 28160},
{ "start": 1536, "length": 104856064, "depth": 1, "zero": true, "data":
false}]
$ qemu-img map --output=json --image-opts \
driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
image.file.filename=top.qcow2,align=4k
[{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data":
false}]
Yikes! Adding blkdebug says there is no data in the file at all!
Actions like 'qemu-img convert' for copying between images would thus
behave differently on a blkdebug image than they would on the real
image, which somewhat defeats the purpose of blkdebug being a filter node.
$ ./qemu-io --image-opts \
driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
image.file.filename=top.qcow2,align=4k
qemu-io> r -P1 0 512
read 512/512 bytes at offset 0
512 bytes, 1 ops; 0.0002 sec (1.782 MiB/sec and 3649.6350 ops/sec)
qemu-io> r -P0 512 512
read 512/512 bytes at offset 512
512 bytes, 1 ops; 0.0002 sec (2.114 MiB/sec and 4329.0043 ops/sec)
qemu-io>
Meanwhile, the data from the backing file is clearly visible when read.
So the bug must lie somewhere in the get_status operation. Looking
closer, I see this in bdrv_co_block_status_above():
2242 for (p = bs; p != base; p = backing_bs(p)) {
When qcow2 is directly opened, this iterates to back.qcow2 and sees that
there is data in the first cluster, changing the overall status reported
to the caller. But when the qcow2 is hidden by blkdebug, backing_bs()
states that blkdebug has no backing image, and terminates the loop early
with JUST the status of the (empty) top file, rather than properly
merging in the status from the backing file.
I don't know if the bug lies in backing_bs(), or in the blkdebug driver,
or in the combination of the two. Maybe it is as simple as fixing
backing_bs() such that on a filter node bs, it defers to
backing_bs(bs->file->bs), to make filter nodes behave as if they have
the same backing file semantics as what they are wrapping.
I was _trying_ to figure out if the block layer and/or blkdebug also
needs to perform rounding of get_status results: if
bs->bl.request_alignment is bigger at the current layer than what it is
in the underlying protocol and/or backing layer, who is responsible for
rounding the block status up to the proper alignment boundaries exposed
by the layer being questioned? Or should we instead make sure that NBD
advertises the smallest alignment of anything in the chain, rather than
limiting itself to just the alignment of the top layer in the chain?
But since the graph can be modified on the fly, it's possible that the
smallest alignment anywhere in the chain can change over time.
And when there is no backing file in the mix, blkdebug does indeed have
the problem that it reports boundaries at an alignment smaller than what
it was declared to honor:
$ ./qemu-img map --output=json --image-opts \
driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
image.file.filename=back.qcow2,align=4k
[{ "start": 0, "length": 512, "depth": 0, "zero": false, "data": true,
"offset": 27648},
{ "start": 512, "length": 512, "depth": 0, "zero": true, "data": false},
{ "start": 1024, "length": 512, "depth": 0, "zero": false, "data": true,
"offset": 28160},
{ "start": 1536, "length": 104856064, "depth": 0, "zero": true, "data":
false}]
Presumably, when rounding a collection of smaller statuses from an
underlying layer into the aligned status of a current layer with
stricter alignment, the sane way to do it would be treating
BDRV_BLOCK_DATA/BDRV_BLOCK_ALLOCATED as sticky set (if any subset had
data, the overall area should report data), BDRV_BLOCK_ZERO as sticky
clear (if any subset does not report zero, the overall area cannot
report zero), BDRV_BLOCK_OFFSET_VALID as sticky unset (if any subset
does not have a valid mapping, or if valid mappings are not contiguous,
the overall area cannot report a mapping). But that logic sounds
complicated enough that it's probably better to do it just once in the
block layer, rather than having to repeat it in the various block
drivers that actually have to cope with the possibility of a protocol or
backing layer with a smaller granularity than the current format layer.
And maybe we want it to be controlled by a flag (where some callers want
to know as precise data as possible, even when the answers are
subdivided smaller than the request_alignment of the initial query;
while other callers like NBD want to guarantee that the answer is
properly rounded to the request_alignment).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org