qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Na


From: Klaus Jensen
Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
Date: Tue, 29 Sep 2020 18:36:47 +0200

On Sep 29 15:43, Dmitry Fomichev wrote:
> > -----Original Message-----
> > From: Qemu-block <qemu-block-
> > bounces+dmitry.fomichev=wdc.com@nongnu.org> On Behalf Of Klaus
> > Jensen
> > Sent: Tuesday, September 29, 2020 6:47 AM
> > To: Damien Le Moal <Damien.LeMoal@wdc.com>
> > Cc: Fam Zheng <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> > block@nongnu.org; Niklas Cassel <Niklas.Cassel@wdc.com>; Klaus Jensen
> > <k.jensen@samsung.com>; qemu-devel@nongnu.org; Alistair Francis
> > <Alistair.Francis@wdc.com>; Keith Busch <kbusch@kernel.org>; Philippe
> > Mathieu-Daudé <philmd@redhat.com>; Matias Bjorling
> > <Matias.Bjorling@wdc.com>
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > and Zoned Namespace Command Set
> > 
> > On Sep 28 22:54, Damien Le Moal wrote:
> > > On 2020/09/29 6:25, Keith Busch wrote:
> > > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > > >>> You are making it sound like the entire WDC series relies on this
> > approach.
> > > >>> Actually, the persistency is introduced in the second to last patch 
> > > >>> in the
> > > >>> series and it only adds a couple of lines of code in the i/o path to 
> > > >>> mark
> > > >>> zones dirty. This is possible because of using mmap() and I find the 
> > > >>> way
> > > >>> it is done to be quite elegant, not ugly :)
> > > >>>
> > > >>
> > > >> No, I understand that your implementation works fine without
> > > >> persistance, but persistance is key. That is why my series adds it in
> > > >> the first patch. Without persistence it is just a toy. And the QEMU
> > > >> device is not just an "NVMe-version" of null_blk.
> > > >
> > > > I really think we should be a bit more cautious of commiting to an
> > > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > > looks too easy to have out-of-sync states, or just not being able to
> > > > boot at all if a qemu versions have different on-disk formats.
> > > >
> > > > Is anyone really considering zone emulation for production level stuff
> > > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > > through that: you are just giving yourself all the downsides of a zoned
> > > > block device and none of the benefits. AFAIK, this is provided as a
> > > > development vehicle, closer to a "toy".
> > > >
> > > > I think we should consider trimming this down to a more minimal set that
> > > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > > bells & whistles and flush out the meta data's data marshalling scheme
> > > > for persistence later.
> > >
> > > +1 on this. Removing the persistence also removes the debate on
> > endianess. With
> > > that out of the way, it should be straightforward to get agreement on a
> > series
> > > that can be merged quickly to get developers started with testing ZNS
> > software
> > > with QEMU. That is the most important goal here. 5.9 is around the corner,
> > we
> > > need something for people to get started with ZNS quickly.
> > >
> > 
> > Wait. What. No. Stop!
> > 
> > It is unmistakably clear that you are invalidating my arguments about
> > portability and endianness issues by suggesting that we just remove
> > persistent state and deal with it later, but persistence is the killer
> > feature that sets the QEMU emulated device apart from other emulation
> > options. It is not about using emulation in production (because yeah,
> > why would you?), but persistence is what makes it possible to develop
> > and test "zoned FTLs" or something that requires recovery at power up.
> > This is what allows testing of how your host software deals with opened
> > zones being transitioned to FULL on power up and the persistent tracking
> > of LBA allocation (in my series) can be used to properly test error
> > recovery if you lost state in the app.
> > 
> > Please, work with me on this instead of just removing such an essential
> > feature. Since persistence seems to be the only thing we are really
> > discussing, we should have plenty of time until the soft-freeze to come
> > up with a proper solution on that.
> > 
> > I agree that my version had a format that was pretty ad-hoc and that
> > won't fly - it needs magic and version capabilities like in Dmitry's
> > series, which incidentially looks a lot like what we did in the
> > OpenChannel implementation, so I agree with the strategy.
> 
> Are you insinuating that I somehow took stuff from OCSSD code and try
> to claim priority this way? I am not at all that familiar with that code.
> And I've already sent you the link to tcmu-runner code that served me
> as an inspiration for implementing persistence in WDC patchset.
> That code has been around for years, uses mmap, works great and has
> nothing to do with you.
> 

No. I am not insinuating anything. The OpenChannel device also used a
blockdev, but, yes, incidentially (and sorry, I should not have used
that word), it looked like how we did it there and I noted that I agreed
with the strategy.

> > 
> > ZNS-wise, the only thing my implementation stores is the zone
> > descriptors (in spec-native little-endian format) and the zone
> > descriptor extensions. So there are no endian issues with those. The
> > allocation tracking bitmap is always stored in little endian, but
> > converted to big-endian if running on a big-endian host.
> > 
> > Let me just conjure something up.
> > 
> >     #define NVME_PSTATE_MAGIC ...
> >     #define NVME_PSTATE_V1    1
> > 
> >     typedef struct NvmePstateHeader {
> >         uint32_t magic;
> >         uint32_t version;
> > 
> >         uint64_t blk_len;
> > 
> >         uint8_t  lbads;
> >         uint8_t  iocs;
> > 
> >         uint8_t  rsvd18[3054];
> > 
> >         struct {
> >             uint64_t zsze;
> >             uint8_t  zdes;
> >         } QEMU_PACKED zns;
> > 
> >         uint8_t  rsvd3089[1007];
> >     } QEMU_PACKED NvmePstateHeader;
> > 
> 
> Why conjure something that already exists in WDC patchset? And that part
> has been published in the very first version of our patches, weeks before
> your entire ZNS series was posted. Add an rsvd[] here and there and that code
> can be as suitable to achieve the stated goals as what you have above.
> 

Yes, I read your code. I know you have a header and I also noted above
that "it needs magic and version capabilities like in Dmitry's series".

> > series,
> > With such a header we have all we need. We can bail out if any
> > parameters do not match and similar to nvme data structures it contains
> > reserved areas for future use. I'll be posting a v2 with this. If this
> > still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
> > feature.
> > 
> > I can agree that we drop other optional features like zone excursions
> > and the reset/finish recommended limit simulation, but PLEASE DO NOT
> > remove persistence and upstream a half-baked version when we are so
> > close and have time to get it right.
> 
> One important thing IMO is to reduce future need for metadata versioning.
> This requires a really good effort to design and review the proper metadata
> format that would become stable for some time. Think about portability.
> To get out something "conjured up" now and then have to move to V2
> metadata in the next release is even worse than no persistence at all.
> So maybe is makes sense to go with Keith's suggestion.

As I've said, we have time until the soft-freeze to get this right. I
"conjured" something up to show a point. The reason we review and
iterate is to NOT upstream something thats is conjured up.

But we gotta start somewhere, no? So what is so bad about me posting a
v2?

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]