qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V3 5/6] block/rbd: add write zeroes support


From: Peter Lieven
Subject: Re: [PATCH V3 5/6] block/rbd: add write zeroes support
Date: Sun, 27 Jun 2021 21:29:32 +0200


> Am 26.06.2021 um 17:57 schrieb Ilya Dryomov <idryomov@gmail.com>:
> 
> On Mon, Jun 21, 2021 at 10:49 AM Peter Lieven <pl@kamp.de> wrote:
>> 
>>> Am 18.06.21 um 12:34 schrieb Ilya Dryomov:
>>> On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven <pl@kamp.de> wrote:
>>>> Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
>>>>> On Wed, May 19, 2021 at 4:28 PM Peter Lieven <pl@kamp.de> wrote:
>>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>>>> ---
>>>>>>  block/rbd.c | 37 ++++++++++++++++++++++++++++++++++++-
>>>>>>  1 file changed, 36 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>> index 0d8612a988..ee13f08a74 100644
>>>>>> --- a/block/rbd.c
>>>>>> +++ b/block/rbd.c
>>>>>> @@ -63,7 +63,8 @@ typedef enum {
>>>>>>      RBD_AIO_READ,
>>>>>>      RBD_AIO_WRITE,
>>>>>>      RBD_AIO_DISCARD,
>>>>>> -    RBD_AIO_FLUSH
>>>>>> +    RBD_AIO_FLUSH,
>>>>>> +    RBD_AIO_WRITE_ZEROES
>>>>>>  } RBDAIOCmd;
>>>>>> 
>>>>>>  typedef struct BDRVRBDState {
>>>>>> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, 
>>>>>> QDict *options, int flags,
>>>>>>          }
>>>>>>      }
>>>>>> 
>>>>>> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>>>>>> +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>>>>> I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
>>>>> does not really have a notion of non-efficient explicit zeroing.
>>>> 
>>>> This is only true if thick provisioning is supported which is in Octopus 
>>>> onwards, right?
>>> Since Pacific, I think.
>>> 
>>>> So it would only be correct to set this if thick provisioning is supported 
>>>> otherwise we could
>>>> 
>>>> fail with ENOTSUP and then qemu emulates the zeroing with plain writes.
>>> I actually had a question about that.  Why are you returning ENOTSUP
>>> in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
>>> because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?
>>> 
>>> My understanding has always been that BDRV_REQ_MAY_UNMAP is just
>>> a hint.  Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
>>> but should be perfectly acceptable.  It is certainly better than
>>> returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
>>> zeroing.
>> 
>> 
>> I think this was introduced to support different provisioning modes. If 
>> BDRV_REQ_MAY_UNMAP is not set
>> 
>> the caller of bdrv_write_zeroes expects that the driver does thick 
>> provisioning. If the driver cannot handle that (efficiently)
>> 
>> qemu does a plain zero write.
>> 
>> 
>> I am still not fully understanding the meaning of the BDRV_REQ_NO_FALLBACK 
>> flag. The original commit states that it was introduced for qemu-img to 
>> efficiently
>> 
>> zero out the target and avoid the slow fallback. When I last worked on 
>> qemu-img convert I remember that there was a call to zero out the target if 
>> bdrv_has_zero_init
>> 
>> is not 1. It seems hat meanwhile a target_is_zero cmdline switch for 
>> qemu-img convert was added to let the user assert that a preexisting target 
>> is zero.
>> 
>> Maybe someone can help here if it would be right to set BDRV_REQ_NO_FALLBACK 
>> for rbd in either of the 2 cases (thick provisioning is support or not)?
> 
> Since no one spoke up I think we should
> 
> a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
>   and as a consequence always unmap if librbd is too old
> 
>   It's not clear what qemu's expectation is but in general Write
>   Zeroes is allowed to unmap.  The only guarantee is that subsequent
>   reads return zeroes, everything else is a hint.  This is how it is
>   specified in the kernel and in the NVMe spec.
> 
>   In particular, block/nvme.c implements it as follows:
> 
>   if (flags & BDRV_REQ_MAY_UNMAP) {
>       cdw12 |= (1 << 25);
>   }
> 
>   This sets the Deallocate bit.  But if it's not set, the device may
>   still deallocate:
> 
>   """
>   If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
>   command, and the namespace supports clearing all bytes to 0h in the
>   values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
>   from a deallocated logical block and its metadata (excluding
>   protection information), then for each specified logical block, the
>   controller:
>   - should deallocate that logical block;
> 
>   ...
> 
>   If the Deallocate bit is cleared to '0' in a Write Zeroes command,
>   and the namespace supports clearing all bytes to 0h in the values
>   read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
>   a deallocated logical block and its metadata (excluding protection
>   information), then, for each specified logical block, the
>   controller:
>   - may deallocate that logical block;
>   """
> 
>   
> https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
> 
> b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
> 
>   Again, it's not clear what qemu expects here, but without it we end
>   up in a ridiculous situation where specifying the "don't allow slow
>   fallback" switch immediately fails all efficient zeroing requests on
>   a device where Write Zeroes is always efficient:
> 
>   $ qemu-io -c 'help write' | grep -- '-[zun]'
>    -n, -- with -z, don't allow slow fallback
>    -u, -- with -z, allow unmapping
>    -z, -- write zeroes using blk_co_pwrite_zeroes
> 
>   $ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
>   write failed: Operation not supported

Agreed,
I will try to send a V4 early this week.

Peter





reply via email to

[Prev in Thread] Current Thread [Next in Thread]