qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment
Date: Thu, 25 Feb 2021 19:23:41 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1

25.02.2021 19:03, Eric Blake wrote:
On 2/25/21 8:55 AM, Vladimir Sementsov-Ogievskiy wrote:
18.02.2021 23:15, Eric Blake wrote:
Previous patches mentioned how the blkdebug filter driver demonstrates
a bug present in our NBD server (for example, commit b0245d64); the
bug is also present with the raw format driver when probing
occurs. Basically, if we specify a particular alignment > 1, but defer
the actual block status to the underlying file, and the underlying
file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
to the underlying file can end up with status split at an alignment
unacceptable to our layer.  Many callers don't care, but NBD does - it
is a violation of the NBD protocol to send status or read results
split on an unaligned boundary (in 737d3f5244, we taught our 4.0
client to be tolerant of such violations because the problem was even
more pronounced with qemu 3.1 as server; but we still need to fix our
server for the sake of other stricter clients).


+ * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one
subregion

Hmm about this..

We already have mess around ALLOCATED:

  [1] for format drivers it means that "read is handled by this layer,
not by backing", i.e. data (or zero) is placed exactly on that layer of
backing-chain

If we're reading at a given granularity, then the 4k read is satisfied
at this layer even if portions of the read came from lower layers.  So
the logic works here.

Hmm.. I can't agree. This way we can say that everything is satisfied at this 
layer. Even if no data in it, we read from this layer and it somehow takes data 
from lower layers.

It's all about terminology of course and we can use same terms for different 
things. For me ALLOCATED for format layer works as follows:

The whole layer is split into clusters. Each cluster is either ALLOCATED 
(format layer produces data on read somehow not touching backing child), or 
UNALLOCATED (format layer just calls read() on backing child with same offset.

And before your patch block_status request never combined clusters with 
different ALLOCATED status.

ALLOCATED status of blocks at some layer of backing chain is significant for 
block-jobs, and if we have several sequential chunks with different ALLOCATED 
status, we can't just consider all of them as ALLOCATED, because in some 
scenarios it will lead to data loss.



  [2] for protocol drivers it's up to the driver, which may always report
ALLOCATED (being more compatible with format drivers) or it may
sometimes return UNALLOCATED to show that data is not allocated in FS..

We've been moving away from this particular overload.  What's more, most
protocol drivers that set it at all set it for every single byte,
because protocol layers don't have a notion of a backing file; which
means that if it is set at all, it will be set for every byte anyway, so
the logic works here.


And this way, bdrv_co_block_status_aligned() is compatible with protocol
drivers but not with format drivers (as you can't combine
"go-to-backing" information of several flags, as for some scenarios it's
safer to consider the whole region ALLOCATED and for another it's safer
to consider it UNALLOCATED.

For example for stream target it's safer to consider target block
UNALLOCATED and do extra copy-on-read operation. And for stream base
it's safer to consider block ALLOCATED (and again do extra copying, not
missing something significant).


I think, to avoid increasing of the mess, we should first split [1] from
[2] somehow..
Assume we change it to BDRV_BLOCK_PROTOCOL_ALLOCATED and
BDRV_BLOCK_GO_TO_BACKING.

Maybe it is indeed worth splitting out two different flags to fully
distinguish between the two overloaded meanings, but that seems like an
independent patch to this series.


Then, for BDRV_BLOCK_PROTOCOL_ALLOCATED we probably can just report
BDRV_BLOCK_PROTOCOL_ALLOCATED if at least one of extents reports
BDRV_BLOCK_PROTOCOL_ALLOCATED. (probably we don't need
BDRV_BLOCK_PROTOCOL_ALLOCATED at all and can drop this logic)

But for BDRV_BLOCK_GO_TO_BACKING we'll have to also add
BDRV_BLOCK_GO_TO_BACKING_VALID and report

  * BDRV_BLOCK_GO_TO_BACKING | BDRV_BLOCK_GO_TO_BACKING_VALID if all
extents report BDRV_BLOCK_GO_TO_BACKING
 * BDRV_BLOCK_GO_TO_BACKING if all extents report no
BDRV_BLOCK_GO_TO_BACKING

  * <nothing> if some extents report BDRV_BLOCK_GO_TO_BACKING but others
not.


Hmm.. And, I think that there is a problem is in NBD protocol. Actually,
with allocation-depth context we started to report internal layers of
backing chain. And if we have layers with different request-alignment
it's not correct to report allocation-depth "aligned" to top image
request-alignment.. So, for allocation-depth to work correctly we should
extend NBD protocol to allow unaligned chunks in BLOCK_STATUS report.

The NBD protocol says that base:allocation must obey allocation rules.
If we want to declare that "because qemu:allocation-depth is an
extension, we choose to NOT obey allocation rules, and if your client
connects to our extension, it MUST be prepared for what would normally
be non-compliant responses to NBD_CMD_BLOCK_STATUS", then we are free to
do so (it is our extension, after all).  Particularly since the only way
I could actually trigger it was with blkdebug (most format layers
support byte-level access, even when layered on top of a protocol layer
with a 512 or 4k minimum byte access).

Hmm, NBD spec says in description of NBD_CMD_BLOCK_STATUS:

The server SHOULD use descriptor lengths that are ..., and MUST use descriptor 
lengths that are an integer multiple of any advertised minimum block size.


So if you think it is better for me to respin the patch to fix ONLY
base:allocation bugs, but not qemu:allocation-depth, and instead merely
document the issue there, I could be persuaded to do so.


So, finally:

1. making bitmap export extents aligned is a separate simple thing -
that's OK

2. making base:allocation aligned is possible due to good NBD_STATE_HOLE
definition. So for it it's correct to combine BDRV_BLOCK_ALLOCATED as
you do even keeping in mind format layers. We call block_status_above
for the whole chain. ALLOCATED=0 only if all format layers refer to
backing and bottom protocol driver(if exists) reports "unallocated in
FS" so that correspond to

"If an extent is marked with NBD_STATE_HOLE at that context, this means
that the given extent is not allocated in the backend storage, and that
writing to the extent MAY result in the NBD_ENOSPC error"

    And this way, I'd prefer to fix exactly base:allocation context
handling in nbd-server not touching generic block_status which already
has enough mess.

3. fixing qemu:allocation-depth I think is impossible without allowing
unaligned chunks in NBD spec (really, why we should restrict all
possible metadata contexts so hard?) Or, if we are still going to align
allocation-depth results to top layer request-alignment we should change
allocation-depth specification so that that's not "actual allocation
depth" but something >= than actual allocation depth of all subchunks...
And that becomes useless.




--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]