qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] qemu-{img,nbd}: Don't report zeroed cluster as a hole


From: Eric Blake
Subject: Re: [PATCH] qemu-{img,nbd}: Don't report zeroed cluster as a hole
Date: Mon, 7 Jun 2021 16:22:24 -0500
User-agent: NeoMutt/20210205

On Mon, Jun 07, 2021 at 11:22:04PM +0300, Nir Soffer wrote:
> When zeroing a cluster in an image with backing file, qemu-img and
> qemu-nbd reported the area as a hole. This does not affect the guest
> since the area is read as zero, but breaks code trying to reconstruct
> the image chain based on qemu-img map or qemu-nbd block status response.

Trying to reconstruct the image chain based on qemu-nbd block status
should not be attempted on just base:allocation data, but should also
take into account qemu:allocation-depth.  From the perspective of the
core NBD protocol, there is no backing file, so trying to guess what
the backing file contains without using qemu extensions is unlikely to
be correct, as shown in your example.  The fact that you could abuse
it with qemu 5.2 but it broke in 6.0 is not necessarily the sign of a
regression in 6.0, but rather could be evidence that you have been
trying to use an undocumented implementation quirk rather than a
stable interface.

> 
> Here is simpler reproducer:
> 
>     # Create a qcow2 image with a raw backing file:
>     $ qemu-img create base.raw $((4*64*1024))
>     $ qemu-img create -f qcow2 -b base.raw -F raw top.qcow2
> 
>     # Write to first 3 clusters of base:
>     $ qemu-io -f raw -c "write -P 65 0 64k" base.raw
>     $ qemu-io -f raw -c "write -P 66 64k 64k" base.raw
>     $ qemu-io -f raw -c "write -P 67 128k 64k" base.raw
> 
>     # Write to second cluster of top, hiding second cluster of base:
>     $ qemu-io -f qcow2 -c "write -P 69 64k 64k" top.qcow2
> 
>     # Write zeroes to third cluster of top, hiding third cluster of base:
>     $ qemu-io -f qcow2 -c "write -z 128k 64k" top.qcow2
> 
> This creates:
> 
>     top:  -D0-
>     base: ABC-
> 
> How current qemu-img and qemu-nbd report the state:
> 
>     $ qemu-img map --output json top.qcow2
>     [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, 
> "offset": 0},
>     { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": 
> true, "offset": 327680},
>     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": 
> false},
>     { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": 
> false, "offset": 196608}]

Note how this one reports "depth":1 when the backing file is consulted...

> 
>     $ qemu-nbd -r -t -f qcow2 top.qcow2 &
>     $ qemu-img map --output json nbd://localhost
>     [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, 
> "offset": 0},
>     { "start": 131072, "length": 131072, "depth": 0, "zero": true, "data": 
> false, "offset": 131072}]

...but since NBD has no notion of a backing file, there is nothing
that qemu can do to report depth information itself.  If you want to
reconstruct the backing chain, you should be able to further query
qemu:allocation-depth, and piece the two queries together to get what
you need:

$ ./qemu-nbd -r -t -f qcow2 top.qcow2 -A
$ nbdinfo --map=qemu:allocation-depth nbd://localhost
         0      131072    1  local
    131072      131072    2  backing depth 2

However, _that_ output looks odd - it claims that clusters 0 and 1 are
local, and 2 and 3 come from a backing file.  Without reading code, I
would have expected something closer to the qcow2 view, claiming that
clusters 1 and 2 are local, while 0 and 3 come from a backing file (3
could also be reported as unallocated, but only if you use a qcow2 as
the backing file instead of raw, since we have no easy way to
determine which holes map to file system allocations in raw files).

/me goes to debug...  I'll need to reply in a later email when I've
spent more time on that.

[Oh, and that reminds me, I would love to patch nbdinfo to let --map
query all available contexts, not just base:allocation, without having
to explicitly name alternative --map=FOO... But it missed today's
stable release of libnbd 1.8]

[The same information can be obtained via qemu-img using
x-dirty-bitmap and --image-opts, but is so much more of a hack that
for now I will just refer to iotest 309 instead of spelling it out
here]

> 
>     $ nbdinfo --map nbd://localhost
>              0      131072    0  allocated
>         131072      131072    3  hole,zero

This faithfully reflects what qemu-img saw, which is all the more the
NBD protocol lets us send without the use of extensions like
qemu:allocation-depth.

> 
> The third extents is reported as a hole in both cases. In qmeu-nbd the

qemu

> cluster is merged with forth cluster which is actually a hole.
> 
> This is incorrect since if it was a hole, the third cluster would be
> exposed to the guest. Programs using qemu-nbd output to reconstruct the
> image chain on other storage would be confused and copy only the first 2
> cluster. The results of this copy will be an image exposing the third
> cluster from the base image, corrupting the guest data.

This is where I disagree - if the NBD protocol exposed the notion of a
backing file, then reporting a local hole should indeed imply reading
from the backing file.  But since base NBD protocol does NOT support
backing images of any sort, you cannot make any assumptions about what
base:allocation says about holes, other than that the image presented
over NBD was not fully allocated in some manner at that location.  You
instead need to fix your tooling to query qemu:allocation-depth if you
are trying to reconstruct all state known by qemu.

> 
> I found that it can be fixed using BDRV_BLOCK_OFFSET_VALID when
> reporting the status of the extent. When we have a valid offset, we
> report based on BDRV_BLOCK_DATA. Otherwise we report based on
> BDRV_BLOCK_ALLOCATED.

That sounds hairy.  As an initial reaction, I'm not sure I like it.

> 
> With this fix we get:
> 
>     $ build/qemu-img map --output json top.qcow2
>     [{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, 
> "offset": 0},
>     { "start": 65536, "length": 65536, "depth": 0, "zero": false, "data": 
> true, "offset": 327680},
>     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": 
> true},

Earlier, this line was "data":false, which made sense - there was no
offset in either top.qcow2 nor in raw.base at which you could mmap to
read the actual zeroes that were implied by the unallocated zero
cluster.  Your change here is reporting "data":true to state that the
zeroes explicitly come from the "depth":0 layer, although it is a bit
misleading because we did not actually allocate clusters in top.qcow2
for reading the zeroes.  In reality, this really IS an instance of a
qcow2 unallocated cluster, where "data":false fits better for the
definitions in include/block/block.h.

>     { "start": 196608, "length": 65536, "depth": 1, "zero": true, "data": 
> false, "offset": 196608}]
> 
>     $ build/qemu-nbd -r -t -f qcow2 top.qcow2 &
>     $ qemu-img map --output json nbd://localhost
>     [{ "start": 0, "length": 131072, "depth": 0, "zero": false, "data": true, 
> "offset": 0},
>     { "start": 131072, "length": 65536, "depth": 0, "zero": true, "data": 
> true, "offset": 131072},
>     { "start": 196608, "length": 65536, "depth": 0, "zero": true, "data": 
> false, "offset": 196608}]

Meanwhile, this output is indeed arguably more precise, although it
differs from the qcow2 output in that you provide an offset for
cluster 2.

> 
>     $ nbdinfo --map nbd://localhost
>              0      131072    0  allocated
>         131072       65536    2  zero
>         196608       65536    3  hole,zero
> 
> The issue was found by ovirt-imageio functional tests:
> https://github.com/oVirt/ovirt-imageio/blob/master/daemon/test/client_test.py
> 
> I did not update any of the existing tests, and I'm sure many tests are
> missing, and the documentation should change to describe the new
> behavior. Posting as is for early review.
> 
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> Resolves: https://bugzilla.redhat.com/1968693
> ---
>  nbd/server.c | 8 ++++++--
>  qemu-img.c   | 4 +++-
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index b60ebc3ab6..adf37905d5 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2127,8 +2127,12 @@ static int blockstatus_to_extents(BlockDriverState 
> *bs, uint64_t offset,
>              return ret;
>          }
>  
> -        flags = (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE) |
> -                (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> +        flags = (ret & BDRV_BLOCK_ZERO ? NBD_STATE_ZERO : 0);
> +
> +        if (ret & BDRV_BLOCK_OFFSET_VALID)
> +            flags |= (ret & BDRV_BLOCK_DATA ? 0 : NBD_STATE_HOLE);
> +        else
> +            flags |= (ret & BDRV_BLOCK_ALLOCATED ? 0 : NBD_STATE_HOLE);

This will fall apart on compressed or encrypted images, where data is
allocated but offset_valid is false.

>  
>          if (nbd_extent_array_add(ea, num, flags) < 0) {
>              return 0;
> diff --git a/qemu-img.c b/qemu-img.c
> index a5993682aa..6808e12d87 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -3039,7 +3039,9 @@ static int get_block_status(BlockDriverState *bs, 
> int64_t offset,
>      *e = (MapEntry) {
>          .start = offset,
>          .length = bytes,
> -        .data = !!(ret & BDRV_BLOCK_DATA),
> +        .data = !!(has_offset
> +            ? ret & BDRV_BLOCK_DATA
> +            : ret & BDRV_BLOCK_ALLOCATED),

I'm really not sure about this.  You are not only changing what
qemu-nbd advertises as a server, but also what qemu-img interprets as
a client.  Are you sure this will still work when you mix-and-match
old server + new client, or new server + old client?

>          .zero = !!(ret & BDRV_BLOCK_ZERO),
>          .offset = map,
>          .has_offset = has_offset,
> -- 
> 2.26.3
>

In short, I agree that the current situation is awkward, but I'm not
sure that this patch is right.  Rather, I'm wondering if we have a bug
in qemu:allocation-depth, and where once that is fixed, you should be
using that alongside base:allocation when deciding how to guess on how
to reconstruct a qcow2 backing chain using only information learned
over NBD.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]