[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment

From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment
Date: Thu, 25 Feb 2021 17:55:21 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1

18.02.2021 23:15, Eric Blake wrote:
Previous patches mentioned how the blkdebug filter driver demonstrates
a bug present in our NBD server (for example, commit b0245d64); the
bug is also present with the raw format driver when probing
occurs. Basically, if we specify a particular alignment > 1, but defer
the actual block status to the underlying file, and the underlying
file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
to the underlying file can end up with status split at an alignment
unacceptable to our layer.  Many callers don't care, but NBD does - it
is a violation of the NBD protocol to send status or read results
split on an unaligned boundary (in 737d3f5244, we taught our 4.0
client to be tolerant of such violations because the problem was even
more pronounced with qemu 3.1 as server; but we still need to fix our
server for the sake of other stricter clients).

This patch lays the groundwork - it adjusts bdrv_block_status to
ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
the deferral is either truncated down to an aligned boundary, or
multiple sub-aligned requests are coalesced into a single
representative answer (using an implicit hole beyond EOF as
needed). Iotest 241 exposes the effect (when format probing occurred,
we don't want block status to subdivide the initial sector, and thus
not any other sector either). Similarly, iotest 221 is a good
candidate to amend to specifically track the effects; a change to a
hole at EOF is not visible if an upper layer does not want alignment
smaller than 512. However, this patch alone is not a complete fix - it
is still possible to have an active layer with large alignment
constraints backed by another layer with smaller constraints; so the
next patch will complete the task.

In particular, the next patch will introduce some mutual recursion
(bdrv_co_common_block_status_above will call this new function rather
than directly into bdrv_co_block_status), so some conditions added
here (such as a NULL pointer for map or handling a length-0 request)
are not reachable until that point.

There is one interesting corner case: prior to this patch, ALLOCATED
was never returned without either DATA or ZERO also set. But now, if
we have two subregions where the first reports status 0 (not
allocated), and the second reports ZERO|ALLOCATED but not DATA
(preallocated, read sees zero but underlying file has indeterminate
contents), then we can end up with a result of just
ALLOCATED. However, looking at callers of bdrv_is_allocated does not
find any problem with this new return value. What's more, this
situation doesn't really happen until the next patch adds support for
getting aligned status from backing files (as the use of aligned
status in this patch tends to be limited to just the protocol child of
a format driver, yet protocol drivers tend to report fully allocated,
and only format drivers have unallocated clusters that defer to a
backing file in the first place).

Signed-off-by: Eric Blake <eblake@redhat.com>
  block/io.c                 | 142 +++++++++++++++++++++++++++++++++++--
  tests/qemu-iotests/221     |  13 ++++
  tests/qemu-iotests/221.out |   9 +++
  tests/qemu-iotests/241.out |   3 +-
  4 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index ca2dca30070e..4bca775c96b4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2325,6 +2325,132 @@ int bdrv_flush_all(void)
      return result;

+static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+                                             bool want_zero,
+                                             int64_t offset, int64_t bytes,
+                                             int64_t *pnum, int64_t *map,
+                                             BlockDriverState **file);
+ * Returns an aligned allocation status of the specified disk region.
+ *
+ * Wrapper around bdrv_co_block_status() which requires the initial
+ * @offset and @count to be aligned to @align (must be power of 2),
+ * and guarantees the resulting @pnum will also be aligned; this may
+ * require piecing together multiple sub-aligned queries into an
+ * appropriate coalesced answer, as follows:
+ *
+ * - BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
+ * - BDRV_BLOCK_ZERO is set only if the flag is set for all subregions
+ * - BDRV_BLOCK_OFFSET_VALID is set only if all subregions are contiguous
+ *   from the same file (@map and @file are then from the first subregion)
+ * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one subregion

Hmm about this..

We already have mess around ALLOCATED:

 [1] for format drivers it means that "read is handled by this layer, not by 
backing", i.e. data (or zero) is placed exactly on that layer of backing-chain

 [2] for protocol drivers it's up to the driver, which may always report 
ALLOCATED (being more compatible with format drivers) or it may sometimes 
return UNALLOCATED to show that data is not allocated in FS..

And this way, bdrv_co_block_status_aligned() is compatible with protocol drivers but not 
with format drivers (as you can't combine "go-to-backing" information of 
several flags, as for some scenarios it's safer to consider the whole region ALLOCATED 
and for another it's safer to consider it UNALLOCATED.

For example for stream target it's safer to consider target block UNALLOCATED 
and do extra copy-on-read operation. And for stream base it's safer to consider 
block ALLOCATED (and again do extra copying, not missing something significant).

I think, to avoid increasing of the mess, we should first split [1] from [2] 
Assume we change it to BDRV_BLOCK_PROTOCOL_ALLOCATED and 

Then, for BDRV_BLOCK_PROTOCOL_ALLOCATED we probably can just report 
BDRV_BLOCK_PROTOCOL_ALLOCATED if at least one of extents reports 
BDRV_BLOCK_PROTOCOL_ALLOCATED. (probably we don't need 
BDRV_BLOCK_PROTOCOL_ALLOCATED at all and can drop this logic)

But for BDRV_BLOCK_GO_TO_BACKING we'll have to also add 


 * <nothing> if some extents report BDRV_BLOCK_GO_TO_BACKING but others not.

Hmm.. And, I think that there is a problem is in NBD protocol. Actually, with 
allocation-depth context we started to report internal layers of backing chain. And if we 
have layers with different request-alignment it's not correct to report allocation-depth 
"aligned" to top image request-alignment.. So, for allocation-depth to work 
correctly we should extend NBD protocol to allow unaligned chunks in BLOCK_STATUS report.

So, finally:

1. making bitmap export extents aligned is a separate simple thing - that's OK

2. making base:allocation aligned is possible due to good NBD_STATE_HOLE definition. So 
for it it's correct to combine BDRV_BLOCK_ALLOCATED as you do even keeping in mind format 
layers. We call block_status_above for the whole chain. ALLOCATED=0 only if all format 
layers refer to backing and bottom protocol driver(if exists) reports "unallocated 
in FS" so that correspond to

"If an extent is marked with NBD_STATE_HOLE at that context, this means that the 
given extent is not allocated in the backend storage, and that writing to the extent MAY 
result in the NBD_ENOSPC error"

   And this way, I'd prefer to fix exactly base:allocation context handling in 
nbd-server not touching generic block_status which already has enough mess.

3. fixing qemu:allocation-depth I think is impossible without allowing unaligned chunks in 
NBD spec (really, why we should restrict all possible metadata contexts so hard?) Or, if we 
are still going to align allocation-depth results to top layer request-alignment we should 
change allocation-depth specification so that that's not "actual allocation depth" 
but something >= than actual allocation depth of all subchunks... And that becomes 

Best regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]