qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above
Date: Wed, 20 May 2020 00:13:45 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0

19.05.2020 23:41, Eric Blake wrote:
On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:
bdrv_co_block_status_above has several problems with handling short
backing files:

1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
which produces these after-EOF zeros is inside requested backing
sequence.

That's intentional.  That portion of the guest-visible data reads as zero 
(BDRV_BLOCK_ZERO set) but was NOT read from the top layer, but rather 
synthesized by the block layer because it derived from the backing file but was 
beyond EOF of that backing layer (BDRV_BLOCK_ALLOCATED is clear).

Not in top yes. But _inside_ the requested base..top backing-chain-part. So it 
should be considered ALLOCATED, as we should not go to further backing.

Assume the following chain:

top    aa--
middle bb
base   xxxx

(so, middle is short)

block_status(top, 2) should return ZERO without ALLOCATED, as yes it's ZERO and 
yes, it's from another layer

block_status_above(top, base, 2) should return ZERO with ALLOCATED, as it's 
ZERO, and it's produced inside requested backing-chain-region, actually, it's 
produced because of short middle node. We must report ALLOCATED to show that we 
are not going to read from base.



2. With want_zero=false, it may return pnum=0 prior to actual EOF,
because of EOF of short backing file.

Do you have a reproducer for this?

No, I don't have one, but it seems possible at least with want_zero=false. I'll 
think of it tomorrow, too tired now.

In my experience, this is not possible.  Generally, if you request status that 
overlaps EOF of the backing, you get a response truncated to the end of the 
backing, and you are then likely to follow up with a subsequent status request 
starting from the underlying EOF which then sees the desired unallocated zeroes:

back     xxxx
top      yy------
request    ^^^^^^
response   ^^
request      ^^^^
response     ^^^^


Fix these things, making logic about short backing files clearer.

Note that 154 output changed, because now bdrv_block_status_above don't

doesn't

merge unallocated zeros with zeros after EOF (which are actually
"allocated" in POV of read from backing-chain top) and is_zero() just
don't understand that the whole head or tail is zero. We may update
is_zero to call bdrv_block_status_above several times, or add flag to
bdrv_block_status_above that we are not interested in ALLOCATED flag,
so ranges with different ALLOCATED status may be merged, but actually,
it seems that we'd better don't care about this corner case.

This actually sounds like an avoidable regression.  :(

I don't see real problem in it. But it seems not hard to avoid it, so I will 
try to.


I argue that if we did not explicitly write data/zero clusters in the tail of 
the top layer, then those clusters are not allocated from the POV of reading 
from the backing-chain top.  Yes, we know what their contents will be, but we 
also know what the contents of unallocated clusters will be when there is no 
backing file at all - basically, after your other patch series to drop 
unallocated_blocks_are_zero:
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg05429.html
then we know that only format drivers that can support backing files even care 
what allocation means, and 'allocated' strictly means that the data comes from 
the top layer rather than from a backing (whether directly from the backing, or 
synthesized as zero by the block layer because it was beyond EOF of the 
backing).

I agree about allocated in top, returned by block_status. But this patch is for 
allocated_above, and the ALLOCATED status is not about top, but about a set of 
nodes from base (not inclusive) to top.



Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
---
  block/io.c                 | 38 +++++++++++++++++++++++++++++---------
  tests/qemu-iotests/154.out |  4 ++--
  2 files changed, 31 insertions(+), 11 deletions(-)


I'm already not a fan of this patch - it adds lines rather than removes, and 
seems to add a regression.

diff --git a/block/io.c b/block/io.c
index 121ce17a49..db990e812b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2461,25 +2461,45 @@ static int coroutine_fn 
bdrv_co_block_status_above(BlockDriverState *bs,
          ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
                                     file);
          if (ret < 0) {
-            break;
+            return ret;
          }
-        if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+        if (*pnum == 0) {
+            if (first) {
+                return ret;
+            }
+
              /*
-             * Reading beyond the end of the file continues to read
-             * zeroes, but we can only widen the result to the
-             * unallocated length we learned from an earlier
-             * iteration.
+             * Reads from bs for the selected region will return zeroes,
+             * produced because the current level is short. We should consider
+             * it as allocated.

Why?  If we replaced the backing file to something longer (qemu-img rebase -u), 
we would WANT to read from the backing file.  The only reason we read zero is 
because the block layer synthesized it _while_ deferring to the backing layer, 
not because it was directly allocated in the top layer.

No, if we replace backing file of the current layer, nothing will change, as 
_this_ layer is short, not the backing. Or which backing file do you mean? If 
you mean current bs, than replacing it doesn't make sense in the context, as 
block_status_above requested the current bs (as part of base..top range), not 
the other one.


+             *
+             * TODO: Should we report p as file here?

No. Reporting 'file' only makes sense if you can point to an offset within that 
file that would read the guest-visible data in question - but when the data is 
synthesized, there is no such offset.

I don't know. It still adds some information about which level is responsible 
for these ZEROES. Kevin argued that it make sense.


               */
+            assert(ret & BDRV_BLOCK_EOF);
              *pnum = bytes;
+            return BDRV_BLOCK_ZERO | BDRV_BLOCK_ALLOCATED;
          }
-        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
-            break;
+        if (ret & BDRV_BLOCK_ALLOCATED) {
+            /* We've found the node and the status, we must return. */
+
+            if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
+                /*
+                 * This level is also responsible for reads after EOF inside
+                 * the unallocated region in the previous level.
+                 */
+                *pnum = bytes;
+            }
+
+            return ret;
          }
+
          /* [offset, pnum] unallocated on this layer, which could be only
           * the first part of [offset, bytes].  */
-        bytes = MIN(bytes, *pnum);
+        assert(*pnum <= bytes);
+        bytes = *pnum;
          first = false;
      }
+
      return ret;
  }
diff --git a/tests/qemu-iotests/154.out b/tests/qemu-iotests/154.out
index fa3673317f..a203dfcadd 100644
--- a/tests/qemu-iotests/154.out
+++ b/tests/qemu-iotests/154.out
@@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  2048/2048 bytes allocated at offset 128 MiB
  [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET}]

The fact that we no longer see zeroes in the tail of the file makes me think 
this patch is wrong.

  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 
backing_file=TEST_DIR/t.IMGFMT.base
  wrote 512/512 bytes at offset 134219264
  512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  2048/2048 bytes allocated at offset 128 MiB
  [{ "start": 0, "length": 134217728, "depth": 1, "zero": true, "data": false},
-{ "start": 134217728, "length": 2048, "depth": 0, "zero": true, "data": false}]
+{ "start": 134217728, "length": 2048, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET}]
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134219776 
backing_file=TEST_DIR/t.IMGFMT.base
  wrote 1024/1024 bytes at offset 134218240
  1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)




--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]