qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inconsistent block status reply in qemu-nbd in qemu-img-6.2.0


From: Nir Soffer
Subject: Re: Inconsistent block status reply in qemu-nbd in qemu-img-6.2.0
Date: Mon, 17 Jan 2022 12:12:34 +0200

On Mon, Jan 17, 2022 at 10:46 AM Hanna Reitz <hreitz@redhat.com> wrote:
>
> On 16.01.22 19:09, Nir Soffer wrote:
> > On Sun, Jan 16, 2022 at 1:17 PM Nir Soffer <nsoffer@redhat.com> wrote:
> >> Some of our tests started to fail since qemu-img 6.2.0 released.
> >>
> >> Here is an example failure:
> >>
> >> $ truncate -s1g /data/scratch/empty-1g.raw
> >>
> >> $ qemu-nbd -r -t -e8 -k /tmp/nbd.sock -A -f raw /data/scratch/empty-1g.raw 
> >> &
> > git bisect points to this commit:
> >
> > $ git bisect good
> > 0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2 is the first bad commit
> > commit 0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2
> > Author: Hanna Reitz <hreitz@redhat.com>
> > Date:   Thu Aug 12 10:41:44 2021 +0200
> >
> >      block: block-status cache for data regions
> >
> > The commit message says:
> >
> >      (Note that only caching data regions but not zero regions means that
> >      returning false information from the cache is not catastrophic: 
> > Treating
> >      zeroes as data is fine.  While we try to invalidate the cache on zero
> >      writes and discards, such incongruences may still occur when there are
> >      other processes writing to the image.)
> >
> > I don't think it is fine to report zero as data. This can cause severe
> > performance
> > issues when users copy unneeded zero extents over the network, instead of
> > doing no work with zero bandwidth.
>
> You’re right, it was meant as a contrast to whether this might lead to
> catastrophic data failure.
>
> But it was also meant as a risk estimation.  There wasn’t supposed to be
> a situation where the cache would report zeroes as data (other than with
> external writers), and so that’s why I thought it’d be fine.  But, well,
> things are supposed to be one way, and then you (me) write buggy code...
>
> Thanks for the reproducer steps, I can reproduce the bug with your
> script (not with nbdinfo fwiw) and the problem seems to be this:
>
> diff --git a/block/io.c b/block/io.c
> index bb0a254def..83e94faeaf 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2498,7 +2498,8 @@ static int coroutine_fn
> bdrv_co_block_status(BlockDriverState *bs,
>                * the cache requires an RCU update, so double check here
> to avoid
>                * such an update if possible.
>                */
> -            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
> +            if (want_zero &&
> +                ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
>                   QLIST_EMPTY(&bs->children))
>               {
>                   /*
>
> We should update the cache only when we have accurate zero information,
> so only if want_zero was true.

Indeed if I disable --allocation-depth in qemu-nbd, this issue cannot
be reproduced.

$ cat reproduce.py
import logging
import subprocess
import time

from ovirt_imageio._internal import nbd

logging.basicConfig(level=logging.DEBUG)

qemu_nbd = subprocess.Popen([
    "./qemu-nbd",
    "--read-only",
    "--persistent",
    "--shared=8",
    ##"--allocation-depth",
    "--socket=/tmp/nbd.sock",
    "--format=raw",
    "/data/scratch/empty-1g.raw"
])

time.sleep(0.2)

with nbd.Client(nbd.UnixAddress("/tmp/nbd.sock")) as c1:
    first = c1.extents(0, 128*1024**2)
    second = c1.extents(0, 128*1024**2)

qemu_nbd.kill()
qemu_nbd.wait()

assert first == second, f"Expected {first}, got {second}"


So this is a result of getting allocation depth after getting base allocation:

2153 static int blockalloc_to_extents(BlockDriverState *bs, uint64_t offset,
2154                                  uint64_t bytes, NBDExtentArray *ea)
2155 {
2156     while (bytes) {
2157         int64_t num;
2158         int ret = bdrv_is_allocated_above(bs, NULL, false, offset, bytes,
2159                                           &num);

We can mitigate this in oVirt by disabling --allocation-depth for raw images,
since it does not provide any value in this case.

> > Do we have a way to disable this cache, at least for nbd?
>
> No, unfortunately not.  We could implement it, but I think the
> management layer would need to specify it for all protocol nodes where
> it wants the cache to be disabled.
>
> Auto-detecting would be difficult for qemu, because it would basically
> mean detecting whether there are any exports in the block graph
> somewhere above the node in question (not even immediately above it,
> because for example in your example there’s a raw node between the
> export and the protocol node whose block-status information we’re caching).
>
> I assume you’re now very skeptical of the cache, so even if we implement
> a fix for this problem, you’ll probably still want that option to
> disable it, right?

It would be safer if we could consume optimizations like this gradually, so in
case of issues like this, the admin can disable the cache in the
field. But this is
a high cost in testing - you need to test both with and without the cache and
this does not scale.

Nir




reply via email to

[Prev in Thread] Current Thread [Next in Thread]