qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Outreachy project task: Adding QEMU block layer APIs resembling Linu


From: Stefan Hajnoczi
Subject: Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
Date: Thu, 2 Jun 2022 09:05:39 +0100

On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilikerun@gmail.com> wrote:
>
> Hi Stefan,
>
> Stefan Hajnoczi <stefanha@gmail.com> 于2022年6月1日周三 19:43写道:
> >
> > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal
> > <damien.lemoal@opensource.wdc.com> wrote:
> > >
> > > On 6/1/22 11:57, Sam Li wrote:
> > > > Hi Stefan,
> > > >
> > > > Stefan Hajnoczi <stefanha@gmail.com> 于2022年5月30日周一 19:19写道:
> > > >
> > > >
> > > >>
> > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilikerun@gmail.com> wrote:
> > > >>>
> > > >>> Hi everyone,
> > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned
> > > >>> device support to QEMU's virtio-blk emulation.
> > > >>>
> > > >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD
> > > >>> ioctls, I think the naive approach would be to introduce a new stable
> > > >>> struct zbd_zone descriptor for the library function interface. More
> > > >>> specifically, what I'd like to add to the BlockDriver struct are:
> > > >>> 1. zbd_info as zone block device information: includes numbers of
> > > >>> zones, size of logical blocks, and physical blocks.
> > > >>> 2. zbd_zone_type and zbd_zone_state
> > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd
> > > >>> With those basic structs, we can start to implement new functions as
> > > >>> bdrv*() APIs for BLOCK*ZONE ioctls.
> > > >>>
> > > >>> I'll start to finish this task based on the above description. If
> > > >>> there is any problem or something I may miss in the design, please let
> > > >>> me know.
> > > >>
> > > >> Hi Sam,
> > > >> Can you propose function prototypes for the new BlockDriver callbacks
> > > >> needed for zoned devices?
> > > >
> > > > I have made some modifications based on Damien's device in design part
> > > > 1 and added the function prototypes in design part 2. If there is any
> > > > problem or part I missed, please let me know.
> > > >
> > > > Design of Block Layer APIs in BlockDriver:
> > > > 1. introduce a new stable struct zbd_zone descriptor for the library
> > > > function interface.
> > > >   a. zbd_info as zone block device information: includes numbers of
> > > > zones, size of blocks, write granularity in byte(minimal write size
> > > > and alignment
> > > >     - write granularity: 512e SMRs: writes in units of physical block
> > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block
> > > > size.
> > > >     - zone descriptor: start, length, capacity, write pointer, zone type
> > > >   b. zbd_zone_type
> > > >     - zone type: conventional, sequential write required, sequential
> > > > write preferred
> > > >   c. zbd_dev_model: host-managed zbd, host-aware zbd
> > >
> > > This explanation is a little hard to understand. It seems to be mixing up
> > > device level information and per-zone information. I think it would be a
> > > lot simpler to write a struct definition to directly illustrate what you
> > > are planning.
> > >
> > > It is something like this ?
> > >
> > > struct zbd_zone {
> > >         enum zone_type  type;
> > >         enum zone_cond  cond;
> > >         uint64_t        start;
> > >         uint32_t        length;
> > >         uint32_t        cap;
> > >         uint64_t        wp;
> > > };
> > >
> > > strcut zbd_dev {
> > >         enum zone_model model;
> > >         uint32_t        block_size;
> > >         uint32_t        write_granularity;
> > >         uint32_t        nr_zones
> > >         struct zbd_zone *zones; /* array of zones */
> > > };
> > >
> > > If yes, then my comments are as follows.
> > >
> > > For the device struct: It may be good to have also the maximum number of
> > > open zones and the maximum number of active zones.
> > >
> > > For the zone struct: You may need to add a read-write lock per zone to be
> > > able to write lock zones to ensure a sequential write pattern (virtio
> > > devices can be multi-queue and so writes may be coming in from different
> > > contexts) and to correctly emulate zone append operations with an atomic
> > > update of the wp field.
> > >
> > > These need to be integrated into the generic block driver interface in
> > > include/block/block_int-common.h or include/block/block-common.h.
> >
> > QEMU's block layer has a few ways of exposing information about block 
> > devices:
> >
> >     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
> >     ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
> > Error **errp);
> >
> > These fetch information from the BlockDriver and are good when a small
> > amount of data is reported occassionally and consumed by the caller.
> >
> > For data that is continuously accessed or that could be large, it may
> > be necessary for the data to reside inside BlockDriverState so that it
> > can be accessed in place (without copying):
> >
> >     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);
> >
> > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that
> > is continuously accessed by the block layer while processing I/O
> > requests. The "refresh" function updates the data in case the
> > underlying storage device has changed somehow. If no update function
> > is necessary then data can simply be populated during .bdrv_open() and
> > no new BlockDriver callback needs to be added.
> >
> > So in the simplest case BlockDriverState can be extended with a struct
> > zbd_dev field that is populated during .bdrv_open(). If the
> > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0
> > or the model field indicates that this is not a zoned storage device.
> >
> > However, a BlockBackend (not BlockDriverState!) API will be needed to
> > expose this data to users like the hw/block/virtio-blk.c emulation
> > code or the qemu-io-cmds.c utility that can be used for testing. A
> > BlockBackend has a root pointer to a BlockDriverState graph (for
> > example, qcow2 on top of file-posix). It will be necessary to
> > propagate zoned storage information from the leaf BlockDriverState all
> > the way up to the BlockBackend. In simple cases the BB root points
> > directly to the file-posix BDS that has Linux ZBD support but the
> > design needs to account for additional BDS graph nodes.
>
> I think a simple way to think BlockBackend APIs is to use following
> callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() +
> blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt().
> The last function call will call bdrv_co_zone_mgmt() in
> block/file-posix.c. If I understand the additional case correctly, the
> BlockBackend API can expose the zone information to the virtio-blk
> emulation now.

Yes!

block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by
calling bdrv_co_zone_mgmt(bs->file, ...). This is because the
raw-format.c driver usually sits on top of file-posix.c and has to
pass through requests.

There are filter block drivers like block/throttle.c,
block/blkdebug.c, etc (git grep is_filter block/) that will also need
to be modified to pass through requests in the same way.

Based on what I've read in Dmitry's virtio-blk spec proposal, the
BlockBackend API could look something like:

typedef struct { ...model, zone_sectors, max_open_zones, etc... } BlockZoneInfo;
void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info);

virtio-blk.c calls this to fill out configuration space fields and
determine whether the BlockBackend is a zoned device.

Then there are 3 commands that happen in the I/O code path:

typedef struct { ... } BlockZoneDescriptor;
BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb,
void *opaque);

typedef enum { ... } BlockZoneMgmtCmd;
BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset,
BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void
*opaque);

typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret,
int64_t new_wp);
BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset,
QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque);

> Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c
> which lead to include/block/block-io.h, we may need consider the case
> when calling block layer API from non-coroutine context. Meanwhile,
> using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per
> zone may be a option too.

Yes, device emulation code usually uses the aio versions of the
BlockBackend I/O functions (read, write, flush). The QEMU block layer
runs the aio I/O request inside a coroutine and usually also exposes
coroutine versions of the same functions. For example, block jobs
(e.g. storage mirroring, backup, and migration background tasks)
usually call the coroutine versions of the BlockBackend APIs instead
of the aio ones.

qemu-io-cmds.c will want synchronous versions of the aio commands
(blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that
block until the command completes. This is because the qemu-io utility
typically executes one command at a time and it's written mostly in
blocking style rather than async callbacks or coroutines.
docs/devel/block-coroutine-wrapper.rst describes how to generate
synchronous versions of coroutine functions.

Do you want to start implementing blk_get_zone_info()? This will
require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c)
functions.

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]