qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable


From: Alex Williamson
Subject: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Date: Fri, 25 Feb 2022 09:15:23 -0700

On Fri, 25 Feb 2022 12:36:24 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/24/22 21:40, Alex Williamson wrote:
> > On Thu, 24 Feb 2022 20:34:40 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >> Of all those cases I would feel the machine-property is better,
> >> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> >> discovering late stage that the user is doing something wrong (and thus
> >> fail the DMA_MAP operation for those who do check invalid iovas)  
> > 
> > The trouble is that anything we can glean from the host system where we
> > instantiate the VM is mostly meaningless relative to data center
> > orchestration.  We're relatively insulated from these sorts of issues
> > on x86 (apparently aside from this case), AIUI ARM is even worse about
> > having arbitrary reserved ranges within their IOVA space.
> >   
> In the multi-socket servers we have for ARM I haven't seen much
> issues /yet/ with VFIO. I only have this reserved region:
> 
> 0x0000000008000000 0x00000000080fffff msi
> 
> But of course ARM servers aren't very good representatives of the
> shifting nature of other ARM machine models. ISTR some thread about GIC ITS 
> ranges
> being reserved by IOMMU in some hardware. Perhaps that's what you might
> be referring to about:
> 
> https://lore.kernel.org/qemu-devel/1510622154-17224-1-git-send-email-zhuyijun@huawei.com/


Right, and notice there also that the msi range is different.  On x86
the msi block is defined by the processor, not the platform and we have
commonality between Intel and AMD on that range.  We emulate the same
range in the guest, so for any x86 guest running on an x86 host, the
msi range is a non-issue because they overlap due to the architectural
standards.

How do you create an ARM guest that reserves a block at both 0x8000000
for your host and 0xc6000000 for the host in the above link?  Whatever
solution we develop to resolve that issue should equally apply to the
AMD reserved block:

0x000000fd00000000 0x000000ffffffffff reserved

> > For a comprehensive solution, it's not a machine accelerator property
> > or enable such-and-such functionality flag, it's the ability to specify
> > a VM memory map on the QEMU command line and data center orchestration
> > tools gaining insight across all their systems to specify a memory
> > layout that can work regardless of how a VM might be migrated. 
> > Maybe
> > there's a "host" option to that memory map command line option that
> > would take into account the common case of a static host or at least
> > homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
> > to generate a least common denominator set such that the VM is
> > compatible to any host in the cluster.
> >   
> 
> I remember you iterated over the initial RFC over such idea. I do like that
> option of adjusting memory map... should any new restrictions appear in the
> IOVA space appear as opposed to have to change the machine code everytime
> that happens.
> 
> 
> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
> with something that hopefully is less painful to bear and unbreaks users doing
> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
> DMA mapping bad IOVAs that may lead guests own spurious failures.
> For the longterm, qemu would need some sort of handling of configurable a 
> sparse
> map of all guest RAM which currently does not exist (and it's stuffed inside 
> on a
> per-machine basis as you're aware). What I am unsure is the churn associated
> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus 
> benefit
> if it's "just" one class of x86 platforms (Intel not affected) -- which is 
> what I find
> attractive with the past 2 revisions via smaller change.
> 
> > On the device end, I really would prefer not to see device driver
> > specific enables and we simply cannot hot-add a device of the given
> > type without a pre-enabled VM.  Give the user visibility and
> > configurability to the issue and simply fail the device add (ideally
> > with a useful error message) if the device IOVA space cannot support
> > the VM memory layout (this is what vfio already does afaik).
> > 
> > When we have iommufd support common for vfio and vdpa, hopefully we'll
> > also be able to recommend a common means for learning about system and
> > IOMMU restrictions to IOVA spaces.   
> 
> Perhaps even advertising platform-wide regions (without a domain allocated) 
> that
> are common in any protection domain (for example on x86 this is one
> such case where MSI/HT ranges are hardcoded in Intel/AMD).
> 
> > For now we have reserved_regions
> > reported in sysfs per IOMMU group:
> > 
> >  $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u | grep 
> > -v direct-relaxable  
> 
> And hopefully iommufd might not want to allow iommu_map() on those reserved
> IOVA regions as opposed to letting that go through. Essentially what VFIO 
> does. Unless of
> course there's actually a case where this is required to iommu_map reserved 
> regions (which
> I don't know).

iommufd is being designed to support a direct replacement for the vfio
specific type1 IOMMU backend, so it will need to have this feature.
Allowing userspace to create invalid mappings would be irresponsible.

I'd tend to agree with MST's recommendation for a more piece-wise
solution, tie the memory map to the vCPU vendor rather than to some
property of the host to account for this reserved range on AMD.  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]