qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ran


From: Tian, Kevin
Subject: Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
Date: Fri, 27 Apr 2018 23:37:24 +0000

> From: Peter Xu [mailto:address@hidden
> Sent: Friday, April 27, 2018 7:40 PM
> 
> On Fri, Apr 27, 2018 at 05:55:27PM +0800, Peter Xu wrote:
> > On Fri, Apr 27, 2018 at 07:44:07AM +0000, Tian, Kevin wrote:
> > > > From: Peter Xu [mailto:address@hidden
> > > > Sent: Friday, April 27, 2018 3:28 PM
> > > >
> > > > On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
> > > > > > From: Jason Wang [mailto:address@hidden
> > > > > > Sent: Friday, April 27, 2018 2:08 PM
> > > > > >
> > > > > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > > > > For each VTDAddressSpace, now we maintain what IOVA ranges
> we
> > > > have
> > > > > > > mapped and what we have not.  With that information, now we
> only
> > > > > > send
> > > > > > > MAP or UNMAP when necessary.  Say, we don't send MAP
> notifies if
> > > > we
> > > > > > know
> > > > > > > we have already mapped the range, meanwhile we don't send
> > > > UNMAP
> > > > > > notifies
> > > > > > > if we know we never mapped the range at all.
> > > > > > >
> > > > > > > Signed-off-by: Peter Xu <address@hidden>
> > > > > > > ---
> > > > > > >   include/hw/i386/intel_iommu.h |  2 ++
> > > > > > >   hw/i386/intel_iommu.c         | 28
> ++++++++++++++++++++++++++++
> > > > > > >   hw/i386/trace-events          |  2 ++
> > > > > > >   3 files changed, 32 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/hw/i386/intel_iommu.h
> > > > > > b/include/hw/i386/intel_iommu.h
> > > > > > > index 486e205e79..09a2e94404 100644
> > > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > > @@ -27,6 +27,7 @@
> > > > > > >   #include "hw/i386/ioapic.h"
> > > > > > >   #include "hw/pci/msi.h"
> > > > > > >   #include "hw/sysbus.h"
> > > > > > > +#include "qemu/interval-tree.h"
> > > > > > >
> > > > > > >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> > > > > > >   #define INTEL_IOMMU_DEVICE(obj) \
> > > > > > > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> > > > > > >       QLIST_ENTRY(VTDAddressSpace) next;
> > > > > > >       /* Superset of notifier flags that this address space has */
> > > > > > >       IOMMUNotifierFlag notifier_flags;
> > > > > > > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> > > > > > >   };
> > > > > > >
> > > > > > >   struct VTDBus {
> > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > > index a19c18b8d4..8f396a5d13 100644
> > > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > > @@ -768,12 +768,37 @@ typedef struct {
> > > > > > >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > > > > > >                                vtd_page_walk_info *info)
> > > > > > >   {
> > > > > > > +    VTDAddressSpace *as = info->as;
> > > > > > >       vtd_page_walk_hook hook_fn = info->hook_fn;
> > > > > > >       void *private = info->private;
> > > > > > > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > > > > > > +                                   entry->iova + 
> > > > > > > entry->addr_mask);
> > > > > > >
> > > > > > >       assert(hook_fn);
> > > > > > > +
> > > > > > > +    /* Update local IOVA mapped ranges */
> > > > > > > +    if (entry->perm) {
> > > > > > > +        if (mapped) {
> > > > > > > +            /* Skip since we have already mapped this range */
> > > > > > > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > > > > >addr_mask,
> > > > > > > +                                             mapped->start, 
> > > > > > > mapped->end);
> > > > > > > +            return 0;
> > > > > > > +        }
> > > > > > > +        it_tree_insert(as->iova_tree, entry->iova,
> > > > > > > +                       entry->iova + entry->addr_mask);
> > > > > >
> > > > > > I was consider a case e.g:
> > > > > >
> > > > > > 1) map A (iova) to B (pa)
> > > > > > 2) invalidate A
> > > > > > 3) map A (iova) to C (pa)
> > > > > > 4) invalidate A
> > > > > >
> > > > > > In this case, we will probably miss a walk here. But I'm not sure it
> was
> > > > > > allowed by the spec (though I think so).
> > > > > >
> > > >
> > > > Hi, Kevin,
> > > >
> > > > Thanks for joining the discussion.
> > > >
> > > > >
> > > > > I thought it was wrong in a glimpse, but then changed my mind after
> > > > > another thinking. As long as device driver can quiescent the device
> > > > > to not access A (iova) within above window, then above sequence
> > > > > has no problem since any stale mappings (A->B) added before step 4)
> > > > > won't be used and then will get flushed after step 4). Regarding to
> > > > > that actually the 1st invalidation can be skipped:
> > > > >
> > > > > 1) map A (iova) to B (pa)
> > > > > 2) driver programs device to use A
> > > > > 3) driver programs device to not use A
> > > > > 4) map A (iova) to C (pa)
> > > > >       A->B may be still valid in IOTLB
> > > > > 5) invalidate A
> > > > > 6) driver programs device to use A
> > > >
> > > > Note that IMHO this is a bit different from Jason's example, and it'll
> > > > be fine.  Current code should work well with this scenario since the
> > > > emulation code will not aware of the map A until step (5).  Then we'll
> > > > have the correct mapping.
> > >
> > > you are right. we still need the extra PSI otherwise the 1st mapping
> > > is problematic for use. So back to Jason's example.
> > >
> > > >
> > > > While for Jason's example it's exactly the extra PSI that might cause
> > > > stale mappings (though again I think it's still problematic...).
> > >
> > > problematic in software side (e.g. that way IOMMU core relies on
> > > device drivers which conflict with the value of using IOMMU) but
> > > it is OK from hardware p.o.v. btw the extra PSI itself doesn't cause
> > > stale mapping. Instead it is device activity after that PSI may cause it.
> > >
> > > >
> > > > Actually I think I can just fix up the code even if Jason's case
> > > > happens by unmapping that first then remap:
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 31e9b52452..2a9584f9d8 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -778,13 +778,21 @@ static int
> vtd_page_walk_one(IOMMUTLBEntry
> > > > *entry, int level,
> > > >      /* Update local IOVA mapped ranges */
> > > >      if (entry->perm) {
> > > >          if (mapped) {
> > > > -            /* Skip since we have already mapped this range */
> > > > -            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > > >addr_mask,
> > > > -                                             mapped->start, 
> > > > mapped->end);
> > > > -            return 0;
> > > > +            int ret;
> > > > +            /* Cache the write permission */
> > > > +            IOMMUAccessFlags flags = entry->perm;
> > > > +
> > > > +            /* UNMAP the old first then remap.  No need to touch IOVA
> tree */
> > > > +            entry->perm = IOMMU_NONE;
> > > > +            ret = hook_fn(entry, private);
> > > > +            if (ret) {
> > > > +                return ret;
> > > > +            }
> > > > +            entry->perm = flags;
> > > > +        } else {
> > > > +            it_tree_insert(as->iova_tree, entry->iova,
> > > > +                           entry->iova + entry->addr_mask);
> > > >          }
> > > > -        it_tree_insert(as->iova_tree, entry->iova,
> > > > -                       entry->iova + entry->addr_mask);
> > > >      } else {
> > > >          if (!mapped) {
> > > >              /* Skip since we didn't map this range at all */
> > > >
> > > > If we really think it necessary, I can squash this in, though this is
> > > > a bit ugly.  But I just want to confirm whether this would be anything
> > > > we want...
> > > >
> > >
> > > I didn’t look into your actual code yet. If others think above
> > > change is OK then it's definitely good as we conform to hardware
> > > behavior here. Otherwise if there is a way to detect such unusual
> > > usage and then adopt some action (say kill the VM), it's also fine
> > > since user knows he runs a bad OS which is not supported by
> > > Qemu. It's just not good if such situation is not handled, which
> > > leads to some undefined behavior which nobody knows the reason
> > > w/o hard debug into. :-)
> >
> > Yeah, then let me do this. :)
> >
> > Jason, would you be good with above change squashed?
> 
> Self NAK on this...
> 
> More than half of the whole series tries to solve the solo problem
> that we unmapped some pages that were already mapped, which proved
> to
> be wrong.  Now if we squash the change we will do the same wrong
> thing, so we'll still have a very small window that the remapped page
> be missing from a device's POV.
> 
> Now to solve this I suppose we'll need to cache every translation then
> we can know whether a mapping has changed, and we only remap when it
> really has changed.  But I'm afraid that can be a big amount of data
> for nested guests.  For a most common 4G L2 guest, I think the worst
> case (e.g., no huge page at all, no continuous pages) is 4G/4K=1M
> entries in that tree.

I think one key factor to think about is the effect of PSI. From
VT-d spec, all internal caches (IOTLB entries, paging-structure
cache entries, etc.) associated with specified address range
must be flushed accordingly, i.e. no cache on stale mapping.
Now per-device iova ranges is new type of cache introduced
by Qemu VT-d. It doesn't cache actual mapping but its purpose
is to decide whether to notify VFIO for updated mapping. In
this manner if we don't differentiate whether an entry is
for stale mapping, looks the definition of PSI is broken.

ask another question. In reality how much improvement this
patch can bring, i.e. is it usual to see guest map on an already
mapped region, or unmap an already unmapped region?

> 
> Is it really worth it to solve this possibly-buggy-guest-OS problem
> with such a overhead?  I don't know..

If adding overhead removes the benefit of this patch, then 
definitely not a good thing.

> 
> I'm not sure whether it's still acceptable that we put this issue
> aside.  We should know that normal OSs should not do this, and if they
> do, IMHO it proves a buggy OS already (so even from hardware POV we
> allow this, from software POV it should still be problematic), then
> it'll have problem for sure, but only within the VM itself, and it
> won't affect other VMs or the host.  That sounds still reasonable to
> me so far.

As said earlier, what I'm worried is whether there is a way to
detect such case when your assumption is violated. usually
emulation can choose to not implement all the features which
are supported on the real device, but it's done in a way that
non-supported features/behaviors can be reported to guest
(based on spec definition) thus guest knows the expectation
from the emulated device...

> 
> Of course I'm always willing to listen to more advice on this.
> 
> Thanks,
> 
> --
> Peter Xu

reply via email to

[Prev in Thread] Current Thread [Next in Thread]