qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC v2] Adding block layer APIs resembling Linux ZoneBlockDevice io


From: Stefan Hajnoczi
Subject: Re: [RFC v2] Adding block layer APIs resembling Linux ZoneBlockDevice ioctls.
Date: Mon, 20 Jun 2022 08:55:05 +0100

On Mon, Jun 20, 2022 at 11:36:11AM +0800, Sam Li wrote:

Hi Sam,
Is this version 2 of "[RFC v1] Add support for zoned device"? Please
keep the email subject line the same (except for "v2", "v3", etc) so
that it's clear which patch series this new version replaces.

> Fix some mistakes before. It can report a range of zones now.

This looks like the description of what changed compared to v1. Please
put the changelog below "---" in the future. When patch emails are
merged by git-am(1) it keeps the text above "---" and discards the text
below "---". The changelog is usually no longer useful once the patches
are merged, so it should be located below the "---" line.

The text above the "---" is the commit description (an explanation of
why this commit is necessary). In this case the commit description
should explain that this patch adds .bdrv_co_zone_report() and
.bdrv_co_zone_mgmt() to BlockDriver so that zoned block devices can be
supported.

> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> ---
>  block/block-backend.c             |  22 ++++
>  block/coroutines.h                |   5 +
>  block/file-posix.c                | 182 ++++++++++++++++++++++++++++++
>  block/io.c                        |  23 ++++
>  include/block/block-common.h      |  43 ++++++-
>  include/block/block-io.h          |  13 +++
>  include/block/block_int-common.h  |  20 ++++
>  qemu-io-cmds.c                    | 118 +++++++++++++++++++
>  tests/qemu-iotests/tests/zoned.sh |  52 +++++++++
>  9 files changed, 477 insertions(+), 1 deletion(-)
>  create mode 100644 tests/qemu-iotests/tests/zoned.sh
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index e0e1aff4b1..20248e4a35 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -104,6 +104,8 @@ typedef struct BlockBackendAIOCB {
>      int ret;
>  } BlockBackendAIOCB;
>  
> +
> +

Please avoid whitespace changes in code that is otherwise untouched by
your patch. Code changes can cause merge conflicts and they make it
harder to use git-annotate(1), so only changes that are necessary should
be included in a patch.

>  static const AIOCBInfo block_backend_aiocb_info = {
>      .get_aio_context = blk_aiocb_get_aio_context,
>      .aiocb_size = sizeof(BlockBackendAIOCB),
> @@ -1810,6 +1812,25 @@ int blk_flush(BlockBackend *blk)
>      return ret;
>  }
>  

Please add a documentation comment for blk_co_zone_report() that
explains how to use the functions and the purpose of the arguments. For
example, does offset have to be the first byte in a zone or can it be
any byte offset? What are the alignment requirements of offset and len?
Why is nr_zones a pointer?

> +int blk_co_zone_report(BlockBackend *blk, int64_t offset, int64_t len,

Functions that run in coroutine context must be labeled with
coroutine_fn:

    int coroutine_fn blk_co_zone_report(...)

This tells humans and tools that the function can only be called from a
coroutine. There is a blog post about coroutines in QEMU here:
https://blog.vmsplice.net/2014/01/coroutines-in-qemu-basics.html

> +                       int64_t *nr_zones,
> +                       struct BlockZoneDescriptor *zones)

QEMU coding style uses typedefs when defining structs, so "struct
BlockZoneDescriptor *zones" should be written as "BlockZoneDescriptor
*zones".

> +{
> +    int ret;

This function is called from the I/O code path, please mark it with:

  IO_CODE();

From include/block/block-io.h:

  * I/O API functions. These functions are thread-safe, and therefore
  * can run in any thread as long as the thread has called
  * aio_context_acquire/release().
  *
  * These functions can only call functions from I/O and Common categories,
  * but can be invoked by GS, "I/O or GS" and I/O APIs.
  *
  * All functions in this category must use the macro
  * IO_CODE();
  * to catch when they are accidentally called by the wrong API.

> +    ret = bdrv_co_zone_report(blk->root->bs, offset, len, nr_zones, zones);

Please add blk_inc_in_flight(blk) and blk_dec_in_flight(blk) around this
function call to ensure that zone report requests finish before I/O is
drained (see bdrv_drained_begin()). This is necessary so that it's
possible to wait for I/O requests, including zone report, to complete.

Similar to blk_co_do_preadv() we need blk_wait_while_drained(blk),
blk_check_byte_request(), and bdrv_inc_in_flight(bs) before calling
bdrv_co_zone_report(). bdrv_dec_in_flight(bs) needs to be called after
bdrv_co_zone_report() returns.

> +    return ret;
> +}
> +
> +int blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
> +        int64_t offset, int64_t len)
> +{
> +    int ret;
> +    ret = bdrv_co_zone_mgmt(blk->root->bs, op, offset, len);
> +
> +    return ret;
> +}

The same applies to this function.

> +
> +
>  void blk_drain(BlockBackend *blk)
>  {
>      BlockDriverState *bs = blk_bs(blk);
> @@ -2634,3 +2655,4 @@ int blk_make_empty(BlockBackend *blk, Error **errp)
>  
>      return bdrv_make_empty(blk->root, errp);
>  }
> +

Unrelated whitespace change.

> diff --git a/block/coroutines.h b/block/coroutines.h
> index 830ecaa733..247326255f 100644
> --- a/block/coroutines.h
> +++ b/block/coroutines.h
> @@ -80,6 +80,11 @@ int coroutine_fn
>  blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
>  
>  int coroutine_fn blk_co_do_flush(BlockBackend *blk);
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +                                    int64_t len, int64_t *nr_zones,
> +                                    struct BlockZoneDescriptor *zones);
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
> +        int64_t offset, int64_t len);
>  
>  
>  /*
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 48cd096624..71fd21f8ba 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -178,6 +178,137 @@ typedef struct BDRVRawReopenState {
>      bool check_cache_dropped;
>  } BDRVRawReopenState;
>  
> +/*
> + * parse_zone - Fill a zone descriptor
> + */
> +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> +        struct blk_zone *blkz) {
> +    zone->start = blkz->start << BDRV_SECTOR_BITS;
> +    zone->length = blkz->len << BDRV_SECTOR_BITS;
> +    zone->cap = blkz->capacity << BDRV_SECTOR_BITS;
> +    zone->wp = blkz->wp << BDRV_SECTOR_BITS;
> +    zone->type = blkz->type;
> +    zone->cond = blkz->type;

Should this be "zone->cond = blkz->cond"?

> +}
> +
> +/**
> + * zone report - Get a zone block device's information in the
> + * form of an array of zone descriptors.
> + *
> + * @param bs: passing zone block device file descriptor
> + * @param zones: Space to hold zone information on reply
> + * @param offset: the location in the zone block device
> + * @return 0 on success, -1 on failure
> + */
> +static int raw_co_zone_report(BlockDriverState *bs, int64_t offset, int64_t 
> len,

coroutine_fn

> +                              int64_t *nr_zones,
> +                              struct BlockZoneDescriptor *zones) {
> +    BDRVRawState *s = bs->opaque;
> +    struct blk_zone_report *rep;
> +    struct BlockZoneDescriptor bzd;
> +    struct blk_zone *blkz;
> +    int64_t zone_size_mask, end, rep_size, nrz;
> +    int ret, n = 0, i = 0;
> +
> +    printf("%s\n", "start to report zone");

This looks like debug output. QEMU has a tracing system that you can
use. See docs/devel/tracing.rst.

> +    nrz = *nr_zones;
> +    rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
> +    rep = (struct blk_zone_report *)malloc(rep_size);

Please use g_autofree and g_new(). QEMU avoids direct use of malloc(3)/free(3).

> +    if (!rep) {
> +        return -1;
> +    }
> +
> +    zone_size_mask = zone_start_sector - 1;
> +    /* align up */
> +    end = ((offset + len + zone_size_mask) & (~zone_size_mask))
> +            >> BDRV_SECTOR_BITS;

More readable:

  end = DIV_ROUND_UP(offset + len, BDRV_SECTOR_SIZE);

> +
> +    blkz = (struct blk_zone *)(rep + 1);
> +    while (offset < end) {
> +        memset(rep, 0, rep_size);
> +        rep->sector = offset;
> +        rep->nr_zones = nrz;
> +
> +        ret = ioctl(s->fd, BLKREPORTZONE, rep);

Can this ioctl() block? It seems likely. If yes, then the call needs to
be made from the thread pool to avoid blocking the current thread. See
raw_thread_pool_submit().

> +        if (ret != 0) {
> +            ret = -errno;
> +            error_report("%d: ioctl BLKREPORTZONE at %ld failed %d",
> +                         s->fd, offset, errno);
> +            free(rep);
> +            return ret;
> +        }
> +
> +        if (!rep->nr_zones) {
> +            break;
> +        }
> +
> +        for (i = 0; i < rep->nr_zones; i++) {
> +            parse_zone(&bzd, &blkz[i]);
> +            if (zones) {
> +                memcpy(&zones[n], &bzd, sizeof(bzd));

n is never incremented so this always overwrites the first element.

> +            }
> +        }
> +
> +        offset = blkz[i].start + blkz[i].len;
> +    }
> +

Does this function need to update *nr_zones = n before returning? How does
the caller know how many zones were returned?

> +    return ret;
> +}
> +
> +/**
> + * zone management operations - Execute an operation on a zone
> + */
> +static int raw_co_zone_mgmt(BlockDriverState *bs, enum zone_op op,
> +        int64_t offset, int64_t len) {
> +    BDRVRawState *s = bs->opaque;
> +    struct blk_zone_range range;
> +    const char *ioctl_name;
> +    uint64_t ioctl_op;

ioctl()'s second argument is unsigned long request. Please use unsigned
long to keep the types consistent.

> +    int64_t zone_size_mask, end;
> +    int ret;
> +
> +    zone_size_mask = zone_start_sector - 1;
> +    /* align up */
> +    end = ((offset + len + zone_size_mask) & (~zone_size_mask))
> +            >> BDRV_SECTOR_BITS;
> +    offset = (offset & (~zone_size_mask)) >> BDRV_SECTOR_BITS; /* align down 
> */
> +
> +    switch (op) {
> +    case zone_open:
> +        ioctl_name = "BLKOPENZONE";
> +        ioctl_op = BLKOPENZONE;
> +        break;
> +    case zone_close:
> +        ioctl_name = "BLKCLOSEZONE";
> +        ioctl_op = BLKCLOSEZONE;
> +        break;
> +    case zone_finish:
> +        ioctl_name = "BLKFINISHZONE";
> +        ioctl_op = BLKFINISHZONE;
> +        break;
> +    case zone_reset:
> +        ioctl_name = "BLKRESETZONE";
> +        ioctl_op = BLKRESETZONE;
> +        break;
> +    default:
> +        error_report("Invalid zone operation 0x%x", op);
> +        errno = -EINVAL;
> +        return -1;
> +    }
> +
> +    /* Execute the operation */
> +    range.sector = offset;
> +    range.nr_sectors = end - offset;
> +    ret = ioctl(s->fd, ioctl_op, &range);
> +    if (ret != 0) {
> +        error_report("ioctl %s failed %d",
> +                         ioctl_name, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>  static int fd_open(BlockDriverState *bs)
>  {
>      BDRVRawState *s = bs->opaque;
> @@ -3324,6 +3455,9 @@ BlockDriver bdrv_file = {
>      .bdrv_abort_perm_update = raw_abort_perm_update,
>      .create_opts = &raw_create_opts,
>      .mutable_opts = mutable_opts,
> +
> +    .bdrv_co_zone_report = raw_co_zone_report,
> +    .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
>  };
>  
>  /***********************************************/
> @@ -3703,6 +3837,53 @@ static BlockDriver bdrv_host_device = {
>  #endif
>  };
>  
> +static BlockDriver bdrv_zoned_host_device = {
> +        .format_name = "zoned_host_device",
> +        .protocol_name = "zoned_host_device",
> +        .instance_size = sizeof(BDRVRawState),
> +        .bdrv_needs_filename = true,
> +        .bdrv_probe_device  = hdev_probe_device,
> +        .bdrv_parse_filename = hdev_parse_filename,
> +        .bdrv_file_open     = hdev_open,
> +        .bdrv_close         = raw_close,
> +        .bdrv_reopen_prepare = raw_reopen_prepare,
> +        .bdrv_reopen_commit  = raw_reopen_commit,
> +        .bdrv_reopen_abort   = raw_reopen_abort,
> +        .bdrv_co_create_opts = bdrv_co_create_opts_simple,
> +        .create_opts         = &bdrv_create_opts_simple,
> +        .mutable_opts        = mutable_opts,
> +        .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
> +        .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes,
> +
> +        .bdrv_co_preadv         = raw_co_preadv,
> +        .bdrv_co_pwritev        = raw_co_pwritev,
> +        .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> +        .bdrv_co_pdiscard       = hdev_co_pdiscard,
> +        .bdrv_co_copy_range_from = raw_co_copy_range_from,
> +        .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> +        .bdrv_refresh_limits = raw_refresh_limits,
> +        .bdrv_io_plug = raw_aio_plug,
> +        .bdrv_io_unplug = raw_aio_unplug,
> +        .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> +
> +        .bdrv_co_truncate       = raw_co_truncate,
> +        .bdrv_getlength = raw_getlength,
> +        .bdrv_get_info = raw_get_info,
> +        .bdrv_get_allocated_file_size
> +                            = raw_get_allocated_file_size,
> +        .bdrv_get_specific_stats = hdev_get_specific_stats,
> +        .bdrv_check_perm = raw_check_perm,
> +        .bdrv_set_perm   = raw_set_perm,
> +        .bdrv_abort_perm_update = raw_abort_perm_update,
> +        .bdrv_probe_blocksizes = hdev_probe_blocksizes,
> +        .bdrv_probe_geometry = hdev_probe_geometry,
> +        .bdrv_co_ioctl = hdev_co_ioctl,
> +
> +        /* zone management operations */
> +        .bdrv_co_zone_report = raw_co_zone_report,
> +        .bdrv_co_zone_mgmt = raw_co_zone_mgmt,
> +};
> +
>  #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
>  static void cdrom_parse_filename(const char *filename, QDict *options,
>                                   Error **errp)
> @@ -3964,6 +4145,7 @@ static void bdrv_file_init(void)
>  #if defined(HAVE_HOST_BLOCK_DEVICE)
>      bdrv_register(&bdrv_host_device);
>  #ifdef __linux__
> +    bdrv_register(&bdrv_zoned_host_device);
>      bdrv_register(&bdrv_host_cdrom);
>  #endif
>  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
> diff --git a/block/io.c b/block/io.c
> index 789e6373d5..3e8bb83cc3 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3258,6 +3258,29 @@ out:
>      return co.ret;
>  }
>  
> +int bdrv_co_zone_report(BlockDriverState *bs,
> +                        int64_t offset, int64_t len, int64_t *nr_zones,
> +                        struct BlockZoneDescriptor *zones)
> +{
> +    int ret = 0;
> +    if (!bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones)) {
> +        return -ENOTSUP;
> +    }

This returns -ENOTSUP any time ->bdrv_co_zone_report() returns 0. Should
this be:

  if (!bs->drv->bdrv_co_zone_report) {
      return -ENOTSUP;
  }

  return bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones);

?

> +
> +    return ret;
> +}
> +
> +int bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op,
> +        int64_t offset, int64_t len)
> +{
> +    int ret = 0;
> +    if (!bs->drv->bdrv_co_zone_mgmt(bs, op, offset, len)) {
> +        return -ENOTSUP;
> +    }
> +
> +    return ret;
> +}
> +
>  void *qemu_blockalign(BlockDriverState *bs, size_t size)
>  {
>      IO_CODE();
> diff --git a/include/block/block-common.h b/include/block/block-common.h
> index fdb7306e78..24c468d8ad 100644
> --- a/include/block/block-common.h
> +++ b/include/block/block-common.h
> @@ -23,7 +23,7 @@
>   */
>  #ifndef BLOCK_COMMON_H
>  #define BLOCK_COMMON_H
> -
> +#include <linux/blkzoned.h>

Linux-specific code must be #ifdef __linux__ to avoid compilation errors
on non-Linux hosts.

>  #include "block/aio.h"
>  #include "block/aio-wait.h"
>  #include "qemu/iov.h"
> @@ -48,6 +48,47 @@
>  typedef struct BlockDriver BlockDriver;
>  typedef struct BdrvChild BdrvChild;
>  typedef struct BdrvChildClass BdrvChildClass;
> +enum zone_type {

Please use "typedef enum { ... } BdrvZoneType" so that enums follow
QEMU's coding style.

> +    BLK_ZT_CONV = BLK_ZONE_TYPE_CONVENTIONAL,
> +    BLK_ZT_SWR = BLK_ZONE_TYPE_SEQWRITE_REQ,
> +    BLK_ZT_SWP = BLK_ZONE_TYPE_SEQWRITE_PREF,
> +};
> +
> +enum zone_cond {
> +    BLK_ZS_NOT_WP = BLK_ZONE_COND_NOT_WP,
> +    BLK_ZS_EMPTY = BLK_ZONE_COND_EMPTY,
> +    BLK_ZS_IOPEN = BLK_ZONE_COND_IMP_OPEN,
> +    BLK_ZS_EOPEN = BLK_ZONE_COND_EXP_OPEN,
> +    BLK_ZS_CLOSED = BLK_ZONE_COND_CLOSED,
> +    BLK_ZS_RDONLY = BLK_ZONE_COND_READONLY,
> +    BLK_ZS_FULL = BLK_ZONE_COND_FULL,
> +    BLK_ZS_OFFLINE = BLK_ZONE_COND_OFFLINE,
> +};

This 1:1 correspondence with Linux constants could make the code a
little harder to port.

Maybe QEMU's block layer should define its own numeric constants so the
code doesn't rely on operating system-specific headers.
block/file-posix.c #ifdef __linux__ code can be responsible for
converting Linux-specific constants to QEMU constants (and the 1:1
mapping can be used there).

> +
> +enum zone_op {
> +    zone_open,
> +    zone_close,
> +    zone_finish,
> +    zone_reset,
> +};
> +
> +/*
> + * Zone descriptor data structure.
> + * Provide information on a zone with all position and size values in bytes.
> + */
> +typedef struct BlockZoneDescriptor {
> +    uint64_t start;
> +    uint64_t length;
> +    uint64_t cap;
> +    uint64_t wp;
> +    enum zone_type type;
> +    enum zone_cond cond;
> +} BlockZoneDescriptor;
> +
> +enum zone_model {
> +    BLK_Z_HM,
> +    BLK_Z_HA,
> +};
>  
>  typedef struct BlockDriverInfo {
>      /* in bytes, 0 if irrelevant */
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index 62c84f0519..deb8902684 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -80,6 +80,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void 
> *buf);
>  /* Ensure contents are flushed to disk.  */
>  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
>  
> +/* Report zone information of zone block device. */
> +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                                     int64_t len, int64_t *nr_zones,
> +                                     struct BlockZoneDescriptor *zones);
> +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op,
> +        int64_t offset, int64_t len);
> +
>  int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
>  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
>  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
> @@ -290,6 +297,12 @@ bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector 
> *qiov, int64_t pos);
>  int generated_co_wrapper
>  bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
>  
> +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset,
> +                                         int64_t len, int64_t *nr_zones,
> +                                         struct BlockZoneDescriptor *zones);
> +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, enum zone_op op,
> +        int64_t offset, int64_t len);
> +
>  /**
>   * bdrv_parent_drained_begin_single:
>   *
> diff --git a/include/block/block_int-common.h 
> b/include/block/block_int-common.h
> index 8947abab76..4d0180a7da 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -63,6 +63,7 @@
>  #define BLOCK_OPT_EXTL2             "extended_l2"
>  
>  #define BLOCK_PROBE_BUF_SIZE        512
> +#define zone_start_sector           512
>  
>  enum BdrvTrackedRequestType {
>      BDRV_TRACKED_READ,
> @@ -94,6 +95,19 @@ typedef struct BdrvTrackedRequest {
>      struct BdrvTrackedRequest *waiting_for;
>  } BdrvTrackedRequest;
>  
> +/**
> + * Zone device information data structure.
> + * Provide information on a device.
> + */
> +typedef struct zbd_dev {
> +    enum zone_model model;
> +    uint32_t block_size;

How does this relate to QEMU's BlockLimits?

> +    uint32_t write_granularity;

Same. Maybe this belongs in BlockLimits?

> +    uint32_t nr_zones;

Should this really be limited to 32-bit? For example, take 256 MB zones,
then the max nr_zones * 256 MB is much smaller than a uint64_t capacity
value. It seems safer to make this 64-bit, but maybe Dmitry, Hannes, or
Damien can tell us what to do here.

> +    struct BlockZoneDescriptor *zones; /* array of zones */

When is this used?

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]