qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G bound


From: Igor Mammedov
Subject: Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
Date: Mon, 28 Jun 2021 17:21:50 +0200

On Mon, 28 Jun 2021 14:43:48 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/28/21 2:25 PM, Igor Mammedov wrote:
> > On Wed, 23 Jun 2021 14:07:29 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 6/23/21 1:09 PM, Igor Mammedov wrote:  
> >>> On Wed, 23 Jun 2021 10:51:59 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>     
> >>>> On 6/23/21 10:03 AM, Igor Mammedov wrote:    
> >>>>> On Tue, 22 Jun 2021 16:49:00 +0100
> >>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>>>       
> >>>>>> It is assumed that the whole GPA space is available to be
> >>>>>> DMA addressable, within a given address space limit. Since
> >>>>>> v5.4 based that is not true, and VFIO will validate whether
> >>>>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >>>>>> on behalf of some specific devices or platform-defined.
> >>>>>>
> >>>>>> AMD systems with an IOMMU are examples of such platforms and
> >>>>>> particularly may export only these ranges as allowed:
> >>>>>>
> >>>>>>        0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >>>>>>        00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >>>>>>        0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>>>>>
> >>>>>> We already know of accounting for the 4G hole, albeit if the
> >>>>>> guest is big enough we will fail to allocate a >1010G given
> >>>>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>>>>>
> >>>>>> When creating the region above 4G, take into account what
> >>>>>> IOVAs are allowed by defining the known allowed ranges
> >>>>>> and search for the next free IOVA ranges. When finding a
> >>>>>> invalid IOVA we mark them as reserved and proceed to the
> >>>>>> next allowed IOVA region.
> >>>>>>
> >>>>>> After accounting for the 1Tb hole on AMD hosts, mtree should
> >>>>>> look like:
> >>>>>>
> >>>>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >>>>>>        alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >>>>>> 0000010000000000-000001037fffffff (prio 0, i/o):
> >>>>>>        alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff    
> >>>>>>   
> >>>>>
> >>>>> You are talking here about GPA which is guest specific thing
> >>>>> and then somehow it becomes tied to host. For bystanders it's
> >>>>> not clear from above commit message how both are related.
> >>>>> I'd add here an explicit explanation how AMD host is related GPAs
> >>>>> and clarify where you are talking about guest/host side.
> >>>>>       
> >>>> OK, makes sense.
> >>>>
> >>>> Perhaps using IOVA makes it easier to understand. I said GPA because
> >>>> there's an 1:1 mapping between GPA and IOVA (if you're not using 
> >>>> vIOMMU).    
> >>>
> >>> IOVA may be a too broad term, maybe explain it in terms of GPA and HPA
> >>> and why it does matter on each side (host/guest)
> >>>     
> >>
> >> I used the term IOVA specially because that is applicable to Host IOVA or
> >> Guest IOVA (same rules apply as this is not special cased for VMs). So,
> >> regardless of whether we have guest mode page tables, or just host
> >> iommu page tables, this address range should be reserved and not used.  
> > 
> > IOVA doesn't make it any clearer, on contrary it's more confusing.
> > 
> > And does host's HPA matter at all? (if host's firmware isn't broken,
> > it should never use nor advertise 1Tb hole). 
> > So we probably talking here only about GPA only.
> >   
> For the case in point for the series, yes it's only GPA that we care about.
> 
> Perhaps I misunderstood your earlier comment where you said how HPAs were
> affected, so I was trying to encompass the problem statement in a Guest/Host
> agnostic manner by using IOVA given this is all related to IOMMU reserved 
> ranges.
> I'll stick to GPA to avoid any confusion -- as that's what matters for this 
> series.

Even better is to add here a reference to spec where it says so.

> 
> >>>>> also what about usecases:
> >>>>>  * start QEMU with Intel cpu model on AMD host with intel's iommu      
> >>>>
> >>>> In principle it would be less likely to occur. But you would still need
> >>>> to mark the same range as reserved. The limitation is on DMA occuring
> >>>> on those IOVAs (host or guest) coinciding with that range, so you would
> >>>> want to inform the guest that at least those should be avoided.
> >>>>    
> >>>>>  * start QEMU with AMD cpu model and AMD's iommu on Intel host      
> >>>>
> >>>> Here you would probably only mark the range, solely for honoring how 
> >>>> hardware
> >>>> is usually represented. But really, on Intel, nothing stops you from 
> >>>> exposing the
> >>>> aforementioned range as RAM.
> >>>>    
> >>>>>  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
> >>>>>       
> >>>> This one is tricky. Because you can hotplug a VFIO device later on,
> >>>> I opted for always marking the reserved range. If you don't use VFIO 
> >>>> you're good, but
> >>>> otherwise you would still need reserved. But I am not sure how qtest is 
> >>>> used
> >>>> today for testing huge guests.    
> >>> I do not know if there are VFIO tests in qtest (probably nope, since that
> >>> could require a host configured for that), but we can add a test
> >>> for his memory quirk (assuming phys-bits won't get in the way)
> >>>     
> >>
> >>    Joao
> >>  
> >   
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]