Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

From:	Alex Williamson
Subject:	Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Date:	Thu, 24 Feb 2022 14:40:53 -0700
On Thu, 24 Feb 2022 20:34:40 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/24/22 20:12, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:  
> >> On 2/24/22 19:54, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:  
> >>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:  
> >>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:  
> >>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:  
> >>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:  
> >>>>>>>> On 2/23/22 23:35, Joao Martins wrote:  
> >>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:  
> >>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>>>> +                                          uint64_t 
> >>>>>>>>>>> pci_hole64_size)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>>>>>> +
> >>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>>>>> +        return;
> >>>>>>>>>>> +    }  
> >>>>>>>>>>
> >>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU 
> >>>>>>>>>> ID?
> >>>>>>>>>> It's really about what we present to the guest though,
> >>>>>>>>>> isn't it?  
> >>>>>>>>>
> >>>>>>>>> It was the easier catch all to use cpuid without going into
> >>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is 
> >>>>>>>>> only
> >>>>>>>>> for systems with an IOMMU present.
> >>>>>>>>>  
> >>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>>>>>>>  
> >>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() 
> >>>>>>>>> helper
> >>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child 
> >>>>>>>>> entries
> >>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT 
> >>>>>>>>> region is
> >>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I 
> >>>>>>>>> don't think it's
> >>>>>>>>> even worth checking the range exists in:
> >>>>>>>>>
> >>>>>>>>>     /sys/kernel/iommu_groups/0/reserved_regions
> >>>>>>>>>
> >>>>>>>>> (Also that sysfs ABI is >= 4.11 only)  
> >>>>>>>>
> >>>>>>>> Here's what I have staged in local tree, to address your comment.
> >>>>>>>>
> >>>>>>>> Naturally the first chunk is what's affected by this patch the rest 
> >>>>>>>> is a
> >>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
> >>>>>>>> pass
> >>>>>>>> all the tests and what not.
> >>>>>>>>
> >>>>>>>> I am not entirely sure this is the right place to put such a helper, 
> >>>>>>>> open
> >>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow 
> >>>>>>>> the rest
> >>>>>>>> of the file's style.
> >>>>>>>>
> >>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
> >>>>>>>> --- a/hw/i386/pc.c
> >>>>>>>> +++ b/hw/i386/pc.c
> >>>>>>>> @@ -868,10 +868,8 @@ static void 
> >>>>>>>> x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>                                            uint64_t pci_hole64_size)
> >>>>>>>>  {
> >>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>> -    uint32_t eax, vendor[3];
> >>>>>>>>
> >>>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>> +    if (!qemu_amd_iommu_is_present()) {
> >>>>>>>>          return;
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
> >>>>>>>> --- a/include/qemu/osdep.h
> >>>>>>>> +++ b/include/qemu/osdep.h
> >>>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>>>>>>>   */
> >>>>>>>>  size_t qemu_get_host_physmem(void);
> >>>>>>>>
> >>>>>>>> +/**
> >>>>>>>> + * qemu_amd_iommu_is_present:
> >>>>>>>> + *
> >>>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
> >>>>>>>> + * is present.
> >>>>>>>> + *
> >>>>>>>> + */
> >>>>>>>> +bool qemu_amd_iommu_is_present(void);
> >>>>>>>> +
> >>>>>>>>  /*
> >>>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
> >>>>>>>>   * for the current thread.
> >>>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>>>>> index f2be7321c59f..54cef21217c4 100644
> >>>>>>>> --- a/util/oslib-posix.c
> >>>>>>>> +++ b/util/oslib-posix.c
> >>>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>>>>>>>  #endif
> >>>>>>>>      return 0;
> >>>>>>>>  }
> >>>>>>>> +
> >>>>>>>> +bool qemu_amd_iommu_is_present(void)
> >>>>>>>> +{
> >>>>>>>> +    bool found = false;
> >>>>>>>> +#ifdef CONFIG_LINUX
> >>>>>>>> +    struct dirent *entry;
> >>>>>>>> +    char *path;
> >>>>>>>> +    DIR *dir;
> >>>>>>>> +
> >>>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
> >>>>>>>> +    dir = opendir(path);
> >>>>>>>> +    if (!dir) {
> >>>>>>>> +        g_free(path);
> >>>>>>>> +        return found;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    do {
> >>>>>>>> +            entry = readdir(dir);
> >>>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >>>>>>>> +                found = true;
> >>>>>>>> +                break;
> >>>>>>>> +            }
> >>>>>>>> +    } while (entry);
> >>>>>>>> +
> >>>>>>>> +    g_free(path);
> >>>>>>>> +    closedir(dir);
> >>>>>>>> +#endif
> >>>>>>>> +    return found;
> >>>>>>>> +}  
> >>>>>>>
> >>>>>>> why are we checking whether an AMD IOMMU is present
> >>>>>>> on the host? 
> >>>>>>> Isn't what we care about whether it's
> >>>>>>> present in the VM? What we are reserving after all
> >>>>>>> is a range of GPAs, not actual host PA's ...
> >>>>>>>  
> >>>>>> Right.
> >>>>>>
> >>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >>>>>> and so we need to not map that portion of address space entirely
> >>>>>> and use some other portion of the GPA-space. This has to
> >>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
> >>>>>> place. So, if you do not have an host IOMMU, you can't
> >>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, 
> >>>>>> therefore you
> >>>>>> don't need this problem.
> >>>>>>
> >>>>>> Whether one uses a guest IOMMU or not does not affect the result,
> >>>>>> it would be more of a case of mimicking real hardware not fixing
> >>>>>> the issue of this series.  
> >>>>>
> >>>>>
> >>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> >>>>> And ideally move the logic reporting reserved ranges there too.
> >>>>> However, I think vdpa has the same issue too.
> >>>>> CC Jason for an opinion.  
> >>>>
> >>>> It does have the same problem.
> >>>>
> >>>> This is not specific to VFIO, it's to anything that uses the iommu.  
> >>>
> >>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> >>> that we don't have a global "enable-vfio" flag since vfio devices
> >>> can be hot-plugged. I think this is not the first time we wanted
> >>> something like this, right Alex?


I would very much NOT like to see such a flag ever existing.


> >>>> It's
> >>>> just that VFIO doesn't let you shoot in the foot :)
> >>>>
> >>>> vDPA would need to validate iova ranges as well to prevent mapping on
> >>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
> >>>> map request is within a valid iova range"). Now you could argue that
> >>>> it's an IOMMU core problem, maybe.
> >>>>
> >>>> But I this as a separate problem,
> >>>> because regardless of said validation, your guest provisioning
> >>>> is still going to fail for guests with >=1010G of max GPA and even if
> >>>> it doesn't fail you shouldn't be letting it DMA map those in the first
> >>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
> >>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole). 
> >>>>  
> >>>
> >>> I wonder what's the status on a system without an IOMMU.
> >>>  
> >> In theory you should be OK. Also it's worth keeping in mind that AMD 
> >> machines
> >> with >=1T have this 12G hole marked as Reserved, so even DMA at last when 
> >> CPU
> >> is the initiator should be fine on worst case scenario.  
> > 
> > Not sure I understand this last sentence.
> >   
> I was trying to say that the E820 from hardware is marked as Reserved and any 
> DMA
> from/to endpoints from kernel-supplied CPU addresses will use those reserved 
> ranges.
> 
> >>>>> Also, some concerns
> >>>>> - I think this changes memory layout for working VMs not using VFIO.
> >>>>>   Better preserve the old layout for old machine types?
> >>>>>  
> >>>> Oh definitely, and I do that in this series. See the last commit, and
> >>>> in the past it was also a machine-property exposed to userspace.
> >>>> Otherwise I would be breaking (badly) compat/migration.
> >>>>
> >>>> I would like to emphasize that these memory layout changes are 
> >>>> *exclusive* and
> >>>> *only* for hosts on AMD with guests memory being bigger than 
> >>>> ~950G-~1010G.
> >>>> It's not every guest, but a fairly limited subset of big-memory guests 
> >>>> that
> >>>> are not the norm.
> >>>>
> >>>> Albeit the phys-bits property errors might gives us a bit more trouble, 
> >>>> so
> >>>> it might help being more conservative.
> >>>>  
> >>>>> - You mention the phys bits issue very briefly, and it's pretty
> >>>>>   concerning.  Do we maybe want to also disable the work around if phys
> >>>>>   bits is too small?   
> >>>>
> >>>> We are doing that here (well, v4), as suggested by Igor. Note that in 
> >>>> this series
> >>>> it's a warning, but v4 I have it as an error and with the 32-bit issues 
> >>>> that
> >>>> I found through qtest.
> >>>>
> >>>> I share the same concern as you over making this an error because of 
> >>>> compatibility.
> >>>> Perhaps, we could go back to the previous version and turn back into a 
> >>>> warning and
> >>>> additionally even disabling the relocation all together. Or if desired 
> >>>> even
> >>>> depending on a machine-property to enable it.  
> >>>
> >>> I would be inclined to disable the relocation. And maybe block vfio? I'm
> >>> not 100% sure about that last one, but this can be a vfio decision to
> >>> make.
> >>>  
> >> I don't see this as a VFIO problem (VFIO is actually being a good citizen 
> >> and doing the
> >> right thing :)). From my perspective this fix is also useful to vDPA 
> >> (which we also care),
> >> and thus will also let us avoid DMA mapping in these GPA ranges.
> >>
> >> The reason I mention VFIO in cover letter is that it's one visible UAPI 
> >> change that
> >> users will notice that suddenly their 1TB+ VMs break.  
> > 
> > What I mean is that most VMs don't use either VFIO or VDPA.
> > If we had VFIO,VDPA as an accelerator that needs to be listed
> > upfront when qemu is started, we could solve this with
> > a bit less risk by not changing anything for VMs without these two.
> >   
> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
> you what are those allowed IOVA ranges when you create the container.
> 
> The machine property, though, could communicate /the intent/ to add
> any VFIO or vDPA devices in the future and maybe cover for that. That
> let's one avoid any 'accelerator-only' problems and restrict the issues
> of this series to VMs with VFIO/VDPA if the risk is a concern.
> 
> > Alex what do you think about adding this?
> > 
> > Given we do not have such a thing right now, one can get
> > into a situation where phys bits limit CPU, then
> > more memory is supplied than would fit above reserved region, then
> > we stick the memory over the reserved region, and finally
> > then vfio device is added.
> >   
> The current code considers the maximum possible address considering
> memory hotplug, PCI hole64 and etc. So we would need to worry about
> whether VFIO or VDPA is going to be hotplugged at some point in the
> future during guest lifecycle, do decide to alter the memory layout
> at guest provisioning.
> 
> > In this last configuration, should vfio device add fail?
> > Or just warn and try to continue? I think we can code this last
> > decision as part of vfio code and then Alex will get to decide ;)
> > And yes, a similar thing with vdpa.
> >   
> 
> Of all those cases I would feel the machine-property is better,
> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> discovering late stage that the user is doing something wrong (and thus
> fail the DMA_MAP operation for those who do check invalid iovas)

The trouble is that anything we can glean from the host system where we
instantiate the VM is mostly meaningless relative to data center
orchestration.  We're relatively insulated from these sorts of issues
on x86 (apparently aside from this case), AIUI ARM is even worse about
having arbitrary reserved ranges within their IOVA space.

For a comprehensive solution, it's not a machine accelerator property
or enable such-and-such functionality flag, it's the ability to specify
a VM memory map on the QEMU command line and data center orchestration
tools gaining insight across all their systems to specify a memory
layout that can work regardless of how a VM might be migrated.  Maybe
there's a "host" option to that memory map command line option that
would take into account the common case of a static host or at least
homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
to generate a least common denominator set such that the VM is
compatible to any host in the cluster.

On the device end, I really would prefer not to see device driver
specific enables and we simply cannot hot-add a device of the given
type without a pre-enabled VM.  Give the user visibility and
configurability to the issue and simply fail the device add (ideally
with a useful error message) if the device IOVA space cannot support
the VM memory layout (this is what vfio already does afaik).

When we have iommufd support common for vfio and vdpa, hopefully we'll
also be able to recommend a common means for learning about system and
IOMMU restrictions to IOVA spaces.  For now we have reserved_regions
reported in sysfs per IOMMU group:

 $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u | grep -v 
direct-relaxable

Thanks,

Alex
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable, (continued)
- [PATCH v3 5/6] i386/pc: warn if phys-bits is too low, Joao Martins, 2022/02/23
Prev by Date: Re: [PATCH 05/12] compiler.h: drop __printf__ macro MinGW/glib workaround
Next by Date: Re: [PATCH v13 4/4] target/ppc: trigger PERFM EBBs from power8-pmu.c
Previous by thread: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Next by thread: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Index(es):
- Date
- Thread