Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net host device support

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net host device support

From:	Jason Wang
Subject:	Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net host device support
Date:	Mon, 20 Dec 2021 10:48:09 +0800
On Fri, Dec 17, 2021 at 4:35 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, Dec 17, 2021 at 12:26:53PM +0800, Jason Wang wrote:
>
> Dave: You created the VIRTIO vmstate infrastructure in QEMU. Please see
> the bottom of this email about moving to a standard VIRTIO device
> save/load format defined by the VIRTIO spec in the future.
>
> > On Thu, Dec 16, 2021 at 5:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Dec 16, 2021 at 11:01:40AM +0800, Jason Wang wrote:
> > > > On Wed, Dec 15, 2021 at 6:07 PM Stefan Hajnoczi <stefanha@redhat.com> 
> > > > wrote:
> > > > >
> > > > > On Wed, Dec 15, 2021 at 11:18:05AM +0800, Jason Wang wrote:
> > > > > > On Tue, Dec 14, 2021 at 9:11 PM Stefan Hajnoczi 
> > > > > > <stefanha@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Dec 14, 2021 at 10:22:53AM +0800, Jason Wang wrote:
> > > > > > > > On Mon, Dec 13, 2021 at 11:14 PM Stefan Hajnoczi 
> > > > > > > > <stefanha@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Dec 13, 2021 at 10:47:00AM +0800, Jason Wang wrote:
> > > > > > > > > > On Sun, Dec 12, 2021 at 5:30 PM Michael S. Tsirkin 
> > > > > > > > > > <mst@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Dec 11, 2021 at 03:00:27AM +0000, Longpeng (Mike, 
> > > > > > > > > > > Cloud Infrastructure Service Product Dept.) wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> > > > > > > > > > > > > Sent: Thursday, December 9, 2021 5:17 PM
> > > > > > > > > > > > > To: Longpeng (Mike, Cloud Infrastructure Service 
> > > > > > > > > > > > > Product Dept.)
> > > > > > > > > > > > > <longpeng2@huawei.com>
> > > > > > > > > > > > > Cc: jasowang@redhat.com; mst@redhat.com; 
> > > > > > > > > > > > > parav@nvidia.com;
> > > > > > > > > > > > > xieyongji@bytedance.com; sgarzare@redhat.com; Yechuan 
> > > > > > > > > > > > > <yechuan@huawei.com>;
> > > > > > > > > > > > > Gonglei (Arei) <arei.gonglei@huawei.com>; 
> > > > > > > > > > > > > qemu-devel@nongnu.org
> > > > > > > > > > > > > Subject: Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net 
> > > > > > > > > > > > > host device support
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Dec 08, 2021 at 01:20:10PM +0800, 
> > > > > > > > > > > > > Longpeng(Mike) wrote:
> > > > > > > > > > > > > > From: Longpeng <longpeng2@huawei.com>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi guys,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch introduces vhost-vdpa-net device, which 
> > > > > > > > > > > > > > is inspired
> > > > > > > > > > > > > > by vhost-user-blk and the proposal of 
> > > > > > > > > > > > > > vhost-vdpa-blk device [1].
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've tested this patch on Huawei's offload card:
> > > > > > > > > > > > > > ./x86_64-softmmu/qemu-system-x86_64 \
> > > > > > > > > > > > > >     -device 
> > > > > > > > > > > > > > vhost-vdpa-net-pci,vdpa-dev=/dev/vhost-vdpa-0
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For virtio hardware offloading, the most important 
> > > > > > > > > > > > > > requirement for us
> > > > > > > > > > > > > > is to support live migration between offloading 
> > > > > > > > > > > > > > cards from different
> > > > > > > > > > > > > > vendors, the combination of netdev and virtio-net 
> > > > > > > > > > > > > > seems too heavy, we
> > > > > > > > > > > > > > prefer a lightweight way.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Maybe we could support both in the future ? Such as:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Lightweight
> > > > > > > > > > > > > >  Net: vhost-vdpa-net
> > > > > > > > > > > > > >  Storage: vhost-vdpa-blk
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Heavy but more powerful
> > > > > > > > > > > > > >  Net: netdev + virtio-net + vhost-vdpa
> > > > > > > > > > > > > >  Storage: bdrv + virtio-blk + vhost-vdpa
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1] 
> > > > > > > > > > > > > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg797569.html
> > > > > > > > > > > > >
> > > > > > > > > > > > > Stefano presented a plan for vdpa-blk at KVM Forum 
> > > > > > > > > > > > > 2021:
> > > > > > > > > > > > > https://kvmforum2021.sched.com/event/ke3a/vdpa-blk-unified-hardware-and-sof
> > > > > > > > > > > > > tware-offload-for-virtio-blk-stefano-garzarella-red-hat
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's closer to today's virtio-net + vhost-net 
> > > > > > > > > > > > > approach than the
> > > > > > > > > > > > > vhost-vdpa-blk device you have mentioned. The idea is 
> > > > > > > > > > > > > to treat vDPA as
> > > > > > > > > > > > > an offload feature rather than a completely separate 
> > > > > > > > > > > > > code path that
> > > > > > > > > > > > > needs to be maintained and tested. That way QEMU's 
> > > > > > > > > > > > > block layer features
> > > > > > > > > > > > > and live migration work with vDPA devices and re-use 
> > > > > > > > > > > > > the virtio-blk
> > > > > > > > > > > > > code. The key functionality that has not been 
> > > > > > > > > > > > > implemented yet is a "fast
> > > > > > > > > > > > > path" mechanism that allows the QEMU virtio-blk 
> > > > > > > > > > > > > device's virtqueue to be
> > > > > > > > > > > > > offloaded to vDPA.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The unified vdpa-blk architecture should deliver the 
> > > > > > > > > > > > > same performance
> > > > > > > > > > > > > as the vhost-vdpa-blk device you mentioned but with 
> > > > > > > > > > > > > more features, so I
> > > > > > > > > > > > > wonder what aspects of the vhost-vdpa-blk idea are 
> > > > > > > > > > > > > important to you?
> > > > > > > > > > > > >
> > > > > > > > > > > > > QEMU already has vhost-user-blk, which takes a 
> > > > > > > > > > > > > similar approach as the
> > > > > > > > > > > > > vhost-vdpa-blk device you are proposing. I'm not 
> > > > > > > > > > > > > against the
> > > > > > > > > > > > > vhost-vdpa-blk approach in priciple, but would like 
> > > > > > > > > > > > > to understand your
> > > > > > > > > > > > > requirements and see if there is a way to collaborate 
> > > > > > > > > > > > > on one vdpa-blk
> > > > > > > > > > > > > implementation instead of dividing our efforts 
> > > > > > > > > > > > > between two.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > We prefer a simple way in the virtio hardware 
> > > > > > > > > > > > offloading case, it could reduce
> > > > > > > > > > > > our maintenance workload, we no need to maintain the 
> > > > > > > > > > > > virtio-net, netdev,
> > > > > > > > > > > > virtio-blk, bdrv and ... any more. If we need to 
> > > > > > > > > > > > support other vdpa devices
> > > > > > > > > > > > (such as virtio-crypto, virtio-fs) in the future, then 
> > > > > > > > > > > > we also need to maintain
> > > > > > > > > > > > the corresponding device emulation code?
> > > > > > > > > > > >
> > > > > > > > > > > > For the virtio hardware offloading case, we usually use 
> > > > > > > > > > > > the vfio-pci framework,
> > > > > > > > > > > > it saves a lot of our maintenance work in QEMU, we 
> > > > > > > > > > > > don't need to touch the device
> > > > > > > > > > > > types. Inspired by Jason, what we really prefer is 
> > > > > > > > > > > > "vhost-vdpa-pci/mmio", use it to
> > > > > > > > > > > > instead of the vfio-pci, it could provide the same 
> > > > > > > > > > > > performance as vfio-pci, but it's
> > > > > > > > > > > > *possible* to support live migrate between offloading 
> > > > > > > > > > > > cards from different vendors.
> > > > > > > > > > >
> > > > > > > > > > > OK, so the features you are dropping would be migration 
> > > > > > > > > > > between
> > > > > > > > > > > a vdpa, vhost and virtio backends. I think given 
> > > > > > > > > > > vhost-vdpa-blk is seems
> > > > > > > > > > > fair enough... What do others think?
> > > > > > > > > >
> > > > > > > > > > I think it should be fine, and it would be even better to 
> > > > > > > > > > make it not
> > > > > > > > > > specific to device type.
> > > > > > > > >
> > > > > > > > > That's an interesting idea. A generic vDPA VirtIODevice could 
> > > > > > > > > exposed as
> > > > > > > > >
> > > > > > > > >   --device vhost-vdpa-pci,
> > > > > > > > >            [vhostfd=FD,|
> > > > > > > > >             vhostpath=/dev/vhost-vdpa-N]
> > > > > > > > >
> > > > > > > > > (and for virtio-mmio and virtio-ccw too).
> > > > > > > > >
> > > > > > > > > I don't think this is possible yet because the vhost_vdpa 
> > > > > > > > > ioctls are
> > > > > > > > > missing some introspection functionality. Here is what I 
> > > > > > > > > found:
> > > > > > > > > - Device ID: ok, use VHOST_VDPA_GET_DEVICE_ID
> > > > > > > > > - Device feature bits: ok, use VHOST_GET_BACKEND_FEATURES
> > > > > > > > > - Configuration space size: missing, need ioctl for 
> > > > > > > > > ops->get_config_size()
> > > > > > > >
> > > > > > > > Any specific reason that we need this considering we've already 
> > > > > > > > had
> > > > > > > > VHOST_VDPA_GET_CONFIG and we do the size validation there?
> > > > > > >
> > > > > > > QEMU's virtio_init() takes a size_t config_size argument. We need 
> > > > > > > to
> > > > > > > determine the size of the vhost_vdpa's configuration space in 
> > > > > > > order to
> > > > > > > create the VirtIODevice in QEMU.
> > > > > > >
> > > > > > > Do you mean probing by checking for the VHOST_VDPA_GET_CONFIG 
> > > > > > > -E2BIG
> > > > > > > return value? It's hacky but I guess it's possible to do a binary 
> > > > > > > search
> > > > > > > that calls VHOST_VDPA_GET_CONFIG each iteration and reduces the 
> > > > > > > size if
> > > > > > > -E2BIG is returned or increases the size otherwise.
> > > > > > >
> > > > > > > Or do you mean re-writing QEMU's hw/virtio/virtio.c to allow the
> > > > > > > VirtIODevice to override the size and we pass accesses through to
> > > > > > > vhost_vdpa. That way it might be possible to avoid fetching the
> > > > > > > configuration space size at startup, but I'm not sure this will 
> > > > > > > work
> > > > > > > because QEMU might depend on knowing the exact size (e.g. live
> > > > > > > migration).
> > > > > >
> > > > > > Good point, so looking at virtio-blk it has:
> > > > > >
> > > > > >     virtio_blk_set_config_size(s, s->host_features);
> > > > > >     virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, 
> > > > > > s->config_size);
> > > > > >
> > > > > > I think here virtio-blk/net should check the vhost-vdpa features 
> > > > > > here
> > > > > > and fail if they are not the same?
> > > > >
> > > > > The vhost feature bit code in QEMU is complicated and I can't respond
> > > > > without investing too much time studying it :).
> > > > >
> > > > > > This looks better than overriding the config_size with what 
> > > > > > vhost-vdpa
> > > > > > provides since it can override the features that the cli tries to
> > > > > > enable.
> > > > >
> > > > > I'm thinking about the generic --device vhost-vdpa idea. QEMU should 
> > > > > not
> > > > > require knowledge of the device feature bits in that case, so it 
> > > > > cannot
> > > > > calculate the configuration space size.
> > > >
> > > > In this case, it looks to me the config size could be deduced from
> > > > VHOST_VDPA_GET_FEATURES?
> > >
> > > I think we're talking about different things, see below...
> > >
> > > > >
> > > > > > >
> > > > > > > > > - Max virtqueue size: ok, VHOST_VDPA_GET_VRING_NUM
> > > > > > > > > - Number of virtqueues: probe using VHOST_GET_VRING_BASE?
> > > > > > > >
> > > > > > > > I'm not sure whether or not we need this and it seems not 
> > > > > > > > necessary
> > > > > > > > since it can be deduced from the config space and features.
> > > > > > >
> > > > > > > It can only be deduced in a device-specific way (net, blk, etc). 
> > > > > > > I can't
> > > > > > > think of a way to detect the number of virtqueues for an arbitrary
> > > > > > > VIRTIO device from the features bits and configuration space 
> > > > > > > contents.
> > > > > >
> > > > > > Yes, I'm not against this idea but it looks to me it works even 
> > > > > > without this.
> > > > > >
> > > > > > Modern PCI has num_queues but we don't have things like this in MMIO
> > > > > > and legacy PCI.
> > > > >
> > > > > Even if the VIRTIO hardware interface doesn't expose this information 
> > > > > to
> > > > > the guest, QEMU's VirtIODevice API needs it. Device emulation code 
> > > > > must
> > > > > call virtio_add_queue() to expose virtqueues to the guest.
> > > >
> > > > We don't need this for current multiqueue virtio-net with vhost-vdpa
> > > > since the queue num were deduced from the VHOST_VDPA_GET_CONFIG during
> > > > the initialization of vhost-vdpa backend.
> > > >
> > > > If we are talking about generic vhost-vdpa-pci, we don't need
> > > > virtio_add_queue() in this case.
> > >
> > > When I say --device vhost-vdpa I mean a VirtIODevice in QEMU that takes
> > > any /dev/vhost-vdpa-N and exposes the device to the guest (over
> > > virtio-pci, virtio-mmio, or virtio-ccw). It's generic because it has no
> > > knowledge of specific device types. This means new device types can be
> > > added without modifying QEMU.
> > >
> > > I think the model you are describing is not generic because it relies on
> > > knowledge of specific device types (net, blk, scsi, etc) so it can
> > > interpret feature bits and configuration space fields.
> >
> > Yes, but what I meant is that in this case qemu can simply relay the
> > set/get config to vhost-vdpa. And the guest driver can enumerate the
> > number of queues correctly depending on his own knowledge.
>
> That requires changes to how virtqueues are managed by
> hw/virtio/virtio.c because today the code assumes QEMU knows the number
> of virtqueues. virtio_add_queue() must be called by device emulation
> before the guest driver can configure a virtqueue.

Right.

>
> > >
> > > When you originally said "it would be even better to make it not
> > > specific to device type" I thought you meant a generic --device
> > > vhost-vdpa and that's what I've been describing, but in your recent
> > > replies I guess you have a different model in mind.
> > >
> > > Are there reasons why the generic model won't work?
> >
> > I think not.
> >
> > One thing comes to my mind is that since we provide num_queues via
> > modern virtio-pci, this is probably another call for having the API
> > you described.
> >
> > For the general vhost-vdpa backend, the only thing that may block us
> > is the migration. If we want to make vhost-vdpa type independent, we
> > need first investigate the independent migration facility in virtio
> > spec which is still suspicious.
>
> Yes, definitely.
>
> Another challenge with migration is that the generic vhost-vdpa vmstate
> probably won't be compatible with QEMU's virtio-net/blk/scsi/etc
> vmstates. It would be nice if it was possible to migrate between QEMU
> and vDPA device models since they both implement the same device types.
>
> Maybe the solution is for QEMU's virtio device models to switch to the
> new VIRTIO save/load data format once that has been defined in the spec.
> Then the QEMU VirtIODevice vmstate would be:
> 1. QEMU-specific VirtIODevice state (virtqueue state, etc)
> 2. VIRTIO standard device save/load data (virtio-net mac table, etc)

Right. The question is that do we expect the exact byte stream format
defined in the spec? It looks to me it's sufficient to define each
state that is required for the live migration and leave the byte
stream format to be implementation specific. If we manage to do this,
there's still a chance that we can live migration between those two.

>
> It's still not clear to me how much of the VIRTIO device save/load data
> is implementation-specific. I think the next step forward is to review
> the QEMU vmstates for virtio-net, virtio-gpu, etc to figure out whether
> we can really standardize the save/load data.

Yes, and it should not be hard to have a general load and save based
on key/value pairs which could be defined in the spec. Ideally, it
should be more than enough to enumerate the keys based on the
negotiated features. (But as discussed, virtio-fs and other stateful
devices seem more complicated and a lot of spec work seems like a
requirement before support this).

Thanks

>
> Stefan
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net host device support, (continued)
Prev by Date: Re: [PATCH v2 5/5] hw/input/pckbd: Open-code i8042_setup_a20_line() wrapper
Next by Date: Re: [PATCH v2 5/9] target/riscv: Mark the Hypervisor extension as non experimental
Previous by thread: Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net host device support
Next by thread: Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net host device support
Index(es):
- Date
- Thread