qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups,


From: Peter Xu
Subject: Re: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes
Date: Thu, 17 May 2018 10:45:44 +0800
User-agent: Mutt/1.9.3 (2018-01-21)

On Wed, May 16, 2018 at 09:57:40PM +0800, Jason Wang wrote:
> 
> 
> On 2018年05月16日 14:30, Peter Xu wrote:
> > On Fri, May 04, 2018 at 11:08:01AM +0800, Peter Xu wrote:
> > > v2:
> > > - fix patchew code style warnings
> > > - interval tree: postpone malloc when inserting; simplify node remove
> > >    a bit where proper [Jason]
> > > - fix up comment and commit message for iommu lock patch [Kevin]
> > > - protect context cache too using the iommu lock [Kevin, Jason]
> > > - add vast comment in patch 8 to explain the modify-PTE problem
> > >    [Jason, Kevin]
> > We can hold a bit on reviewing this series.  Jintack reported a scp
> > DMAR issue that might happen even with L1 guest with this series, and
> > the scp can stall after copied tens or hundreds of MBs randomly.  I'm
> > still investigating the problem.  This problem should be related to
> > deferred flushing of VT-d kernel driver, since the problem will go
> > away if we use "intel_iommu=on,strict".  However I'm still trying to
> > figure out what's the thing behind the scene even with that deferred
> > flushing feature.
> 
> I vaguely remember recent upstream vfio support delayed flush, maybe it's
> related.

I'm a bit confused on why vfio is related to the deferred flushing.
Could you provide a pointer for this?

> 
> > 
> > Meanwhile, during the investigation I found another "possibly valid"
> > use case about the modify-PTE problem that Jason has mentioned when
> > with deferred flushing:
> > 
> >           vcpu1                        vcpu2
> >        map page A
> >    explicitly send PSI for A
> >     queue unmap page A [1]
> >                                   map the same page A [2]
> >                                  explcitly send PSI for A [3]
> >     flush unmap page A [4]
> > 
> > Due to deferred flushing, the UNMAP PSI might be postponed (or it can
> > be finally a DSI) from step [1] to step [4].  If we allocate the same
> > page somewhere else, we might trigger this modify-PTE at [2] since we
> > haven't yet received the deferred PSI to unmap A from vcpu1.
> > 
> > Note that this will not happen with latest upstream Linux, since the
> > IOVA allocation algorithm in current Linux kernel made sure that the
> > IOVA range won't be freed until [4], so we can't really allocate the
> > same page address at [2].
> 
> Yes, so the vfio + vIOMMU work will probably uncover more bugs in the IOMMU
> driver (especially CM mode). I suspect CM mode does not have sufficient test
> (since it probably wasn't used in any production environment before the vfio
> + vIOMMU work).

Yes maybe. I might possibly continue investigating some of this after
this "functional" series.  So my plan is that firstly we make it work
functionally (even we can allow some trivial bugs, but scp error is
not in count), then we consider other things.

> 
> >    However this let me tend to agree with
> > Jason and Kevin's worry on future potential issues if that can be
> > triggered easily by common guest kernel bugs.  So now I'm considering
> > to drop my mergable interval tree but just use a simpler tree to cache
> > everything including translated addresses.  The metadata will possibly
> > take 2% of managed memory if with that.
> > 
> 
> Good to know this, we can start from the way we know correct for sure then
> optimizations on top.

Yes.

Btw, still it's not really "correct" here AFAIU - the correct thing
should be that we "modify" the PTE without any invalid DMA window.
Now even if I switch to the other method we still need to unmap and
remap, so we will still have an invalid window for that DMA range.
For example, if Linux kernel frees IOVA ranges after queueing the
unmap in current Linux VT-d drivers (again, this does not exist, but
I'm just assuming), then my current series will be affected while the
other way won't be affected.  I'll still add some comment there to
clarify this as TODO.

Thanks,

-- 
Peter Xu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]