qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inconsistent block status reply in qemu-nbd in qemu-img-6.2.0


From: Hanna Reitz
Subject: Re: Inconsistent block status reply in qemu-nbd in qemu-img-6.2.0
Date: Mon, 17 Jan 2022 10:55:07 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.0

On 17.01.22 10:52, Kevin Wolf wrote:
Am 17.01.2022 um 09:46 hat Hanna Reitz geschrieben:
On 16.01.22 19:09, Nir Soffer wrote:
On Sun, Jan 16, 2022 at 1:17 PM Nir Soffer <nsoffer@redhat.com> wrote:
Some of our tests started to fail since qemu-img 6.2.0 released.

Here is an example failure:

$ truncate -s1g /data/scratch/empty-1g.raw

$ qemu-nbd -r -t -e8 -k /tmp/nbd.sock -A -f raw /data/scratch/empty-1g.raw &
git bisect points to this commit:

$ git bisect good
0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2 is the first bad commit
commit 0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2
Author: Hanna Reitz <hreitz@redhat.com>
Date:   Thu Aug 12 10:41:44 2021 +0200

      block: block-status cache for data regions

The commit message says:

      (Note that only caching data regions but not zero regions means that
      returning false information from the cache is not catastrophic: Treating
      zeroes as data is fine.  While we try to invalidate the cache on zero
      writes and discards, such incongruences may still occur when there are
      other processes writing to the image.)

I don't think it is fine to report zero as data. This can cause severe
performance
issues when users copy unneeded zero extents over the network, instead of
doing no work with zero bandwidth.
You’re right, it was meant as a contrast to whether this might lead to
catastrophic data failure.

But it was also meant as a risk estimation.  There wasn’t supposed to be a
situation where the cache would report zeroes as data (other than with
external writers), and so that’s why I thought it’d be fine.  But, well,
things are supposed to be one way, and then you (me) write buggy code...

Thanks for the reproducer steps, I can reproduce the bug with your script
(not with nbdinfo fwiw) and the problem seems to be this:

diff --git a/block/io.c b/block/io.c
index bb0a254def..83e94faeaf 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2498,7 +2498,8 @@ static int coroutine_fn
bdrv_co_block_status(BlockDriverState *bs,
               * the cache requires an RCU update, so double check here to
avoid
               * such an update if possible.
               */
-            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
+            if (want_zero &&
+                ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
                  QLIST_EMPTY(&bs->children))
              {
                  /*

We should update the cache only when we have accurate zero information, so
only if want_zero was true.
Either that, or store the want_zero value as well and compare it before
using the cache.

Depends on whether we need the optimisation for want_zero=false cases,
too. I think this mostly means bdrv_is_allocated(_above)(), which seems
to be relevant for the commit and stream jobs and 'qemu-img rebase' at
least and some less commonly used stuff. There are some other cases of
is_allocated (like in mirror) where it doesn't make a difference because
we always process the maximum number of bytes with the first call.

I think we need the cache only for want_zero=true cases, because allocation information is handled by the format layer, which doesn’t use the cache at all.  Also, the reason for the cache was to skip slow SEEK_HOLE/DATA calls on the protocol level, which is a want_zero=true case.

Do we have a way to disable this cache, at least for nbd?
No, unfortunately not.  We could implement it, but I think the management
layer would need to specify it for all protocol nodes where it wants the
cache to be disabled.

Auto-detecting would be difficult for qemu, because it would basically mean
detecting whether there are any exports in the block graph somewhere above
the node in question (not even immediately above it, because for example in
your example there’s a raw node between the export and the protocol node
whose block-status information we’re caching).

I assume you’re now very skeptical of the cache, so even if we implement a
fix for this problem, you’ll probably still want that option to disable it,
right?
I don't think we can realistically provide an option for every
individual optimisation where there could be a bug. We just need to fix
the problem and make sure to return correct data when there is no
external writer.

Kevin





reply via email to

[Prev in Thread] Current Thread [Next in Thread]