[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator
|
From: |
Eugenio Perez Martin |
|
Subject: |
Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator |
|
Date: |
Fri, 30 Aug 2024 16:50:19 +0200 |
On Fri, Aug 30, 2024 at 3:52 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
>
>
> On 8/30/24 4:05 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >>
> >> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer <jonah.palmer@oracle.com>
> >>> wrote:
> >>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>
> >>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>
> >>> I think this is a good first iteration but we can take steps to
> >>> simplify it. Also, it is great to be able to make points on real code
> >>> instead of designs on the air :).
> >>>
>
> I can add more comments in the code if this is what you mean, no problem!
>
No action needed about this feedback :). I just meant that it will be
easier to iterate on code than designing just by talking at this
stage.
> >>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>> similar. Similarly, a vhost_iova_tree_remove and
> >>> vhost_iova_tree_remove_gpa would be needed.
> >>> >>> The first one is used for regions that don't exist in the guest, like
> >>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>> memory listener to map the guest regions into the vdpa device.
> >>>
> >>> Implementation wise, only two trees are actually needed:
> >>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>> seen by the device, so both allocation functions can work on a single
> >>> tree. The function iova_tree_find_iova keeps using this one, so the
> >> I thought we had thorough discussion about this and agreed upon the
> >> decoupled IOVA allocator solution.
> >
> > My interpretation of it is to leave the allocator as the current one,
> > and create a new tree with GPA which is guaranteed to be unique. But
> > we can talk over it of course.
> >
>
> So you mean keep the full IOVA->HVA tree but also have a GPA->IOVA tree
> as well for guest memory regions, correct?
>
Right.
> >> But maybe I missed something earlier,
> >> I am not clear how come this iova_tree_find_iova function could still
> >> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >> overlapped HVAs? Granted, for the memory map removal in the
> >> .region_del() path, we could rely on the GPA tree to locate the
> >> corresponding IOVA, but how come the translation path could figure out
> >> which IOVA range to return when the vaddr happens to fall in an
> >> overlapped HVA range?
> >
> > That is not a problem, as they both translate to the same address at the
> > device.
> >
> > The most complicated situation is where we have a region contained in
> > another region, and the requested buffer crosses them. If the IOVA
> > tree returns the inner region, it will return the buffer chained with
> > the rest of the content in the outer region. Not optimal, but solved
> > either way.
> >
> > The only problem that comes to my mind is the case where the inner
> > region is RO and it is a write command, but I don't think we have this
> > case in a sane guest. A malicious guest cannot do any harm this way
> > anyway.
> >
> >> Do we still assume some overlapping order so we
> >> always return the first match from the tree? Or we expect every current
> >> user of iova_tree_find_iova should pass in GPA rather than HVA and use
> >> the vhost_iova_xxx_gpa API variant to look up IOVA?
> >>
> >
> > No, iova_tree_find_iova should keep asking for vaddr, as the result is
> > guaranteed to be there. Users of VhostIOVATree only need to modify how
> > they add or remove regions, knowing if they come from the guest or
> > not. As shown by this series, it is easier to do in that place than in
> > translation.
> >
> >> Thanks,
> >> -Siwei
> >>
> >>> user does not need to know if the address is from the guest or only
> >>> exists in QEMU by using RAMBlock etc. All insert and remove functions
> >>> use this tree.
> >>> * A new tree that relates IOVA to GPA, that only
> >>> vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
> >>>
> >>> The ideal case is that the key in this new tree is the GPA and the
> >>> value is the IOVA. But IOVATree's DMA is named the reverse: iova is
> >>> the key and translated_addr is the vaddr. We can create a new tree
> >>> struct for that, use GTree directly, or translate the reverse
> >>> linearly. As memory add / remove should not be frequent, I think the
> >>> simpler is the last one, but I'd be ok with creating a new tree.
> >>>
>
> Is the concern here that making the gpa_map (GPA->IOVA tree) of type
> IOVATree can be confusing for users from an API perspective?
>
> In other words, IOVATree users should always use its iova member as the
> key and the translated_addr member as the value (and thus IOVATree
> gpa_map feels out of place since it uses the translated_addr as the key
> and iova as the value)?
>
Totally right.
> Also, could you elaborate a bit more on "translate the reverse
> linearly"? Do you mean to create an IOVA->GPA tree but always search the
> tree using the GPA (e.g. via iova_tree_find_iova)?
>
Yes, that's the other option. I'm comparing the two options here:
* IOVA->GPA has the advantage of reusing IOVATree, but lookups for GPA is O(N).
* GPA->IOVA is more natural but we cannot reuse IOVATree in a simple way.
> >>> vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
> >>> also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
> >>> this tree, and only remove the associated DMAMap in iova_taddr_map
> >>> that matches the IOVA.
> >>>
> >>> Does it make sense to you?
>
> Would using a name like vhost_iova_tree_map_alloc_gpa seem a bit
> misleading given that we're already allocating the IOVA range in
> vhost_iova_tree_map_alloc?
With the vhost_iova_tree_map_alloc_gpa there is no need to call
vhost_iova_tree_map_alloc. It's like merging the two operations, so
the caller does not need to know the internals.
> It seems this would be more of an insertion
> rather than an allocation when adding a map to the GPA->IOVA tree.
>
Let's put it another way, why complicate IOVATree or VhostIOVATree by
integrating the GPA->IOVA or the IOVA->GPA if we could simply store
that information in the caller and make each struct simpler?
If we abstract away the two trees under a coherent API, the struct
makes sense. If not, it would be better to let the caller handle the
information.
> Also, are you saying that vhost_iova_tree_remove_gpa only removes the
> DMAMap in the IOVA->HVA tree or should it also remove the corresponding
> mapping in the GPA->IOVA tree?
>
It should remove it in both.
> >>>
> >>>> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com>
> >>>> ---
> >>>> hw/virtio/vhost-iova-tree.c | 38 ++++++++++++++++++++++++++++++++-----
> >>>> hw/virtio/vhost-iova-tree.h | 1 +
> >>>> hw/virtio/vhost-vdpa.c | 31 ++++++++++++++++++++++++------
> >>>> net/vhost-vdpa.c | 13 +++++++++++--
> >>>> 4 files changed, 70 insertions(+), 13 deletions(-)
> >>>>
> >>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >>>> index 3d03395a77..32c03db2f5 100644
> >>>> --- a/hw/virtio/vhost-iova-tree.c
> >>>> +++ b/hw/virtio/vhost-iova-tree.c
> >>>> @@ -28,12 +28,17 @@ struct VhostIOVATree {
> >>>>
> >>>> /* IOVA address to qemu memory maps. */
> >>>> IOVATree *iova_taddr_map;
> >>>> +
> >>>> + /* IOVA tree (IOVA allocator) */
> >>>> + IOVATree *iova_map;
> >>>> };
> >>>>
> >>>> /**
> >>>> - * Create a new IOVA tree
> >>>> + * Create a new VhostIOVATree with a new set of IOVATree's:
> >>> s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
> >>> implementation detail.
> >>>
>
> Gotcha. Would you like me to remove the other comments then?
>
Yes, it is better to remove the ones that expose internals. We should
be able to change these kind of details without the users of the
VhostIOVATree notice it.
> >>>> + * - IOVA allocator (iova_map)
> >>>> + * - IOVA->HVA tree (iova_taddr_map)
> >>>> *
> >>>> - * Returns the new IOVA tree
> >>>> + * Returns the new VhostIOVATree
> >>>> */
> >>>> VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr
> >>>> iova_last)
> >>>> {
> >>>> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first,
> >>>> hwaddr iova_last)
> >>>> tree->iova_last = iova_last;
> >>>>
> >>>> tree->iova_taddr_map = iova_tree_new();
> >>>> + tree->iova_map = iova_tree_new();
> >>>> return tree;
> >>>> }
> >>>>
> >>>> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first,
> >>>> hwaddr iova_last)
> >>>> void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
> >>>> {
> >>>> iova_tree_destroy(iova_tree->iova_taddr_map);
> >>>> + iova_tree_destroy(iova_tree->iova_map);
> >>>> g_free(iova_tree);
> >>>> }
> >>>>
> >>>> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree,
> >>>> DMAMap *map)
> >>>> /* Some vhost devices do not like addr 0. Skip first page */
> >>>> hwaddr iova_first = tree->iova_first ?:
> >>>> qemu_real_host_page_size();
> >>>>
> >>>> - if (map->translated_addr + map->size < map->translated_addr ||
> >>> Why remove this condition? If the request is invalid we still need to
> >>> return an error here.
> >>>
> >>> Maybe we should move it to iova_tree_alloc_map though.
> >>>
>
> This series decoupled the IOVA allocator from also adding a mapping to
> the IOVA->HVA tree and instead added IOVA ranges only to an IOVA-only
> tree. So no value existed under translated_addr for these mappings
> specifically.
>
> This check was moved to vhost_iova_tree_insert since that function
> covered adding the host-only memory mappings to the IOVA->SVQ HVA tree.
>
Ok, I missed that then. Can you extract that in a separated patch? I
think it makes sense by itself.
> >>>> - map->perm == IOMMU_NONE) {
> >>>> + if (map->perm == IOMMU_NONE) {
> >>>> return IOVA_ERR_INVALID;
> >>>> }
> >>>>
> >>>> /* Allocate a node in IOVA address */
> >>>> - return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> >>>> + return iova_tree_alloc_map(tree->iova_map, map, iova_first,
> >>>> tree->iova_last);
> >>>> }
> >>>>
> >>>> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree,
> >>>> DMAMap *map)
> >>>> void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
> >>>> {
> >>>> iova_tree_remove(iova_tree->iova_taddr_map, map);
> >>>> + iova_tree_remove(iova_tree->iova_map, map);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * Insert a new mapping to the IOVA->HVA tree
> >>>> + *
> >>>> + * @tree: The VhostIOVATree
> >>>> + * @map: The iova map
> >>>> + *
> >>>> + * Returns:
> >>>> + * - IOVA_OK if the map fits in the container
> >>>> + * - IOVA_ERR_INVALID if the map does not make sense (like size
> >>>> overflow)
> >>>> + * - IOVA_ERR_OVERLAP if the IOVA range overlaps with an existing range
> >>>> + */
> >>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
> >>>> +{
> >>>> + if (map->translated_addr + map->size < map->translated_addr ||
> >>>> + map->perm == IOMMU_NONE) {
> >>>> + return IOVA_ERR_INVALID;
> >>>> + }
> >>>> +
> >>>> + return iova_tree_insert(iova_tree->iova_taddr_map, map);
> >>>> }
> >>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> >>>> index 4adfd79ff0..8bf7b64786 100644
> >>>> --- a/hw/virtio/vhost-iova-tree.h
> >>>> +++ b/hw/virtio/vhost-iova-tree.h
> >>>> @@ -23,5 +23,6 @@ const DMAMap *vhost_iova_tree_find_iova(const
> >>>> VhostIOVATree *iova_tree,
> >>>> const DMAMap *map);
> >>>> int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
> >>>> void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map);
> >>>> +int vhost_iova_tree_insert(VhostIOVATree *iova_tree, DMAMap *map);
> >>>>
> >>>> #endif
> >>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>>> index 3cdaa12ed5..6702459065 100644
> >>>> --- a/hw/virtio/vhost-vdpa.c
> >>>> +++ b/hw/virtio/vhost-vdpa.c
> >>>> @@ -361,10 +361,10 @@ static void
> >>>> vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>> if (s->shadow_data) {
> >>>> int r;
> >>>>
> >>>> - mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,
> >>>> mem_region.size = int128_get64(llsize) - 1,
> >>>> mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> >>>>
> >>>> + /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>> r = vhost_iova_tree_map_alloc(s->iova_tree, &mem_region);
> >>>> if (unlikely(r != IOVA_OK)) {
> >>>> error_report("Can't allocate a mapping (%d)", r);
> >>>> @@ -372,6 +372,14 @@ static void
> >>>> vhost_vdpa_listener_region_add(MemoryListener *listener,
> >>>> }
> >>>>
> >>>> iova = mem_region.iova;
> >>>> +
> >>>> + /* Add mapping to the IOVA->HVA tree */
> >>>> + mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr;
> >>>> + r = vhost_iova_tree_insert(s->iova_tree, &mem_region);
> >>>> + if (unlikely(r != IOVA_OK)) {
> >>>> + error_report("Can't add listener region mapping (%d)", r);
> >>>> + goto fail_map;
> >>>> + }
> >>> I'd say it is not intuitive for the caller code.
> >>>
>
> Sorry, I'm not sure what you mean by this here. Would you mind
> elaborating a bit more?
>
If VhostIOVATree gets reused it is easy to miss the
vhost_iova_tree_insert after the allocation.
If we're going to expose this second GPA tree, I think it would be
better to place it directly in VhostShadowVirtqueue. All the
conditionals fit better that way and we don't make VhostIOVATree /
IOVATree harder to reuse. However, I keep thinking it is easy enough
to hide in VhostIOVATree.
> >>>> }
> >>>>
> >>>> vhost_vdpa_iotlb_batch_begin_once(s);
> >>>> @@ -1142,19 +1150,30 @@ static void vhost_vdpa_svq_unmap_rings(struct
> >>>> vhost_dev *dev,
> >>>> *
> >>>> * @v: Vhost-vdpa device
> >>>> * @needle: The area to search iova
> >>>> + * @taddr: The translated address (SVQ HVA)
> >>>> * @errorp: Error pointer
> >>>> */
> >>>> static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, DMAMap
> >>>> *needle,
> >>>> - Error **errp)
> >>>> + hwaddr taddr, Error **errp)
> >>>> {
> >>>> int r;
> >>>>
> >>>> + /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>> r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
> >>>> if (unlikely(r != IOVA_OK)) {
> >>>> error_setg(errp, "Cannot allocate iova (%d)", r);
> >>>> return false;
> >>>> }
> >>>>
> >>>> + /* Add mapping to the IOVA->HVA tree */
> >>>> + needle->translated_addr = taddr;
> >>>> + r = vhost_iova_tree_insert(v->shared->iova_tree, needle);
> >>>> + if (unlikely(r != IOVA_OK)) {
> >>>> + error_setg(errp, "Cannot add SVQ vring mapping (%d)", r);
> >>>> + vhost_iova_tree_remove(v->shared->iova_tree, *needle);
> >>>> + return false;
> >>>> + }
> >>>> +
> >>>> r = vhost_vdpa_dma_map(v->shared, v->address_space_id,
> >>>> needle->iova,
> >>>> needle->size + 1,
> >>>> (void *)(uintptr_t)needle->translated_addr,
> >>>> @@ -1192,11 +1211,11 @@ static bool vhost_vdpa_svq_map_rings(struct
> >>>> vhost_dev *dev,
> >>>> vhost_svq_get_vring_addr(svq, &svq_addr);
> >>>>
> >>>> driver_region = (DMAMap) {
> >>>> - .translated_addr = svq_addr.desc_user_addr,
> >>>> .size = driver_size - 1,
> >>>> .perm = IOMMU_RO,
> >>>> };
> >>>> - ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> >>>> + ok = vhost_vdpa_svq_map_ring(v, &driver_region,
> >>>> svq_addr.desc_user_addr,
> >>>> + errp);
> >>>> if (unlikely(!ok)) {
> >>>> error_prepend(errp, "Cannot create vq driver region: ");
> >>>> return false;
> >>>> @@ -1206,11 +1225,11 @@ static bool vhost_vdpa_svq_map_rings(struct
> >>>> vhost_dev *dev,
> >>>> addr->avail_user_addr = driver_region.iova + avail_offset;
> >>>>
> >>>> device_region = (DMAMap) {
> >>>> - .translated_addr = svq_addr.used_user_addr,
> >>>> .size = device_size - 1,
> >>>> .perm = IOMMU_RW,
> >>>> };
> >>>> - ok = vhost_vdpa_svq_map_ring(v, &device_region, errp);
> >>>> + ok = vhost_vdpa_svq_map_ring(v, &device_region,
> >>>> svq_addr.used_user_addr,
> >>>> + errp);
> >>>> if (unlikely(!ok)) {
> >>>> error_prepend(errp, "Cannot create vq device region: ");
> >>>> vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
> >>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> >>>> index 03457ead66..81da956b92 100644
> >>>> --- a/net/vhost-vdpa.c
> >>>> +++ b/net/vhost-vdpa.c
> >>>> @@ -512,15 +512,24 @@ static int vhost_vdpa_cvq_map_buf(struct
> >>>> vhost_vdpa *v, void *buf, size_t size,
> >>>> DMAMap map = {};
> >>>> int r;
> >>>>
> >>>> - map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>> map.size = size - 1;
> >>>> map.perm = write ? IOMMU_RW : IOMMU_RO,
> >>>> +
> >>>> + /* Allocate IOVA range and add the mapping to the IOVA tree */
> >>>> r = vhost_iova_tree_map_alloc(v->shared->iova_tree, &map);
> >>>> if (unlikely(r != IOVA_OK)) {
> >>>> - error_report("Cannot map injected element");
> >>>> + error_report("Cannot allocate IOVA range for injected element");
> >>>> return r;
> >>>> }
> >>>>
> >>>> + /* Add mapping to the IOVA->HVA tree */
> >>>> + map.translated_addr = (hwaddr)(uintptr_t)buf;
> >>>> + r = vhost_iova_tree_insert(v->shared->iova_tree, &map);
> >>>> + if (unlikely(r != IOVA_OK)) {
> >>>> + error_report("Cannot map injected element into IOVA->HVA tree");
> >>>> + goto dma_map_err;
> >>>> + }
> >>>> +
> >>>> r = vhost_vdpa_dma_map(v->shared, v->address_space_id, map.iova,
> >>>> vhost_vdpa_net_cvq_cmd_page_len(), buf,
> >>>> !write);
> >>>> if (unlikely(r < 0)) {
> >>>> --
> >>>> 2.43.5
> >>>>
> >>
> >
>