[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH V3 5/6] block/rbd: add write zeroes support
From: |
Peter Lieven |
Subject: |
Re: [PATCH V3 5/6] block/rbd: add write zeroes support |
Date: |
Sun, 27 Jun 2021 21:29:32 +0200 |
> Am 26.06.2021 um 17:57 schrieb Ilya Dryomov <idryomov@gmail.com>:
>
> On Mon, Jun 21, 2021 at 10:49 AM Peter Lieven <pl@kamp.de> wrote:
>>
>>> Am 18.06.21 um 12:34 schrieb Ilya Dryomov:
>>> On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven <pl@kamp.de> wrote:
>>>> Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
>>>>> On Wed, May 19, 2021 at 4:28 PM Peter Lieven <pl@kamp.de> wrote:
>>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>>>> ---
>>>>>> block/rbd.c | 37 ++++++++++++++++++++++++++++++++++++-
>>>>>> 1 file changed, 36 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>> index 0d8612a988..ee13f08a74 100644
>>>>>> --- a/block/rbd.c
>>>>>> +++ b/block/rbd.c
>>>>>> @@ -63,7 +63,8 @@ typedef enum {
>>>>>> RBD_AIO_READ,
>>>>>> RBD_AIO_WRITE,
>>>>>> RBD_AIO_DISCARD,
>>>>>> - RBD_AIO_FLUSH
>>>>>> + RBD_AIO_FLUSH,
>>>>>> + RBD_AIO_WRITE_ZEROES
>>>>>> } RBDAIOCmd;
>>>>>>
>>>>>> typedef struct BDRVRBDState {
>>>>>> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs,
>>>>>> QDict *options, int flags,
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>>>>>> + bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>>>>> I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
>>>>> does not really have a notion of non-efficient explicit zeroing.
>>>>
>>>> This is only true if thick provisioning is supported which is in Octopus
>>>> onwards, right?
>>> Since Pacific, I think.
>>>
>>>> So it would only be correct to set this if thick provisioning is supported
>>>> otherwise we could
>>>>
>>>> fail with ENOTSUP and then qemu emulates the zeroing with plain writes.
>>> I actually had a question about that. Why are you returning ENOTSUP
>>> in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
>>> because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?
>>>
>>> My understanding has always been that BDRV_REQ_MAY_UNMAP is just
>>> a hint. Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
>>> but should be perfectly acceptable. It is certainly better than
>>> returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
>>> zeroing.
>>
>>
>> I think this was introduced to support different provisioning modes. If
>> BDRV_REQ_MAY_UNMAP is not set
>>
>> the caller of bdrv_write_zeroes expects that the driver does thick
>> provisioning. If the driver cannot handle that (efficiently)
>>
>> qemu does a plain zero write.
>>
>>
>> I am still not fully understanding the meaning of the BDRV_REQ_NO_FALLBACK
>> flag. The original commit states that it was introduced for qemu-img to
>> efficiently
>>
>> zero out the target and avoid the slow fallback. When I last worked on
>> qemu-img convert I remember that there was a call to zero out the target if
>> bdrv_has_zero_init
>>
>> is not 1. It seems hat meanwhile a target_is_zero cmdline switch for
>> qemu-img convert was added to let the user assert that a preexisting target
>> is zero.
>>
>> Maybe someone can help here if it would be right to set BDRV_REQ_NO_FALLBACK
>> for rbd in either of the 2 cases (thick provisioning is support or not)?
>
> Since no one spoke up I think we should
>
> a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
> and as a consequence always unmap if librbd is too old
>
> It's not clear what qemu's expectation is but in general Write
> Zeroes is allowed to unmap. The only guarantee is that subsequent
> reads return zeroes, everything else is a hint. This is how it is
> specified in the kernel and in the NVMe spec.
>
> In particular, block/nvme.c implements it as follows:
>
> if (flags & BDRV_REQ_MAY_UNMAP) {
> cdw12 |= (1 << 25);
> }
>
> This sets the Deallocate bit. But if it's not set, the device may
> still deallocate:
>
> """
> If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
> command, and the namespace supports clearing all bytes to 0h in the
> values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
> from a deallocated logical block and its metadata (excluding
> protection information), then for each specified logical block, the
> controller:
> - should deallocate that logical block;
>
> ...
>
> If the Deallocate bit is cleared to '0' in a Write Zeroes command,
> and the namespace supports clearing all bytes to 0h in the values
> read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
> a deallocated logical block and its metadata (excluding protection
> information), then, for each specified logical block, the
> controller:
> - may deallocate that logical block;
> """
>
>
> https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
>
> b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
>
> Again, it's not clear what qemu expects here, but without it we end
> up in a ridiculous situation where specifying the "don't allow slow
> fallback" switch immediately fails all efficient zeroing requests on
> a device where Write Zeroes is always efficient:
>
> $ qemu-io -c 'help write' | grep -- '-[zun]'
> -n, -- with -z, don't allow slow fallback
> -u, -- with -z, allow unmapping
> -z, -- write zeroes using blk_co_pwrite_zeroes
>
> $ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
> write failed: Operation not supported
Agreed,
I will try to send a V4 early this week.
Peter