qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable


From: Michael S. Tsirkin
Subject: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Date: Thu, 24 Feb 2022 15:12:39 -0500

On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
> 
> 
> On 2/24/22 19:54, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
> >> On 2/24/22 18:30, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> >>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> >>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> >>>>>> On 2/23/22 23:35, Joao Martins wrote:
> >>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>> +                                          uint64_t pci_hole64_size)
> >>>>>>>>> +{
> >>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>>>> +
> >>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>>> +        return;
> >>>>>>>>> +    }
> >>>>>>>>
> >>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
> >>>>>>>> It's really about what we present to the guest though,
> >>>>>>>> isn't it?
> >>>>>>>
> >>>>>>> It was the easier catch all to use cpuid without going into
> >>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>>>>>> for systems with an IOMMU present.
> >>>>>>>
> >>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>>>>>
> >>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region 
> >>>>>>> is
> >>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> >>>>>>> think it's
> >>>>>>> even worth checking the range exists in:
> >>>>>>>
> >>>>>>>       /sys/kernel/iommu_groups/0/reserved_regions
> >>>>>>>
> >>>>>>> (Also that sysfs ABI is >= 4.11 only)
> >>>>>>
> >>>>>> Here's what I have staged in local tree, to address your comment.
> >>>>>>
> >>>>>> Naturally the first chunk is what's affected by this patch the rest is 
> >>>>>> a
> >>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
> >>>>>> pass
> >>>>>> all the tests and what not.
> >>>>>>
> >>>>>> I am not entirely sure this is the right place to put such a helper, 
> >>>>>> open
> >>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the 
> >>>>>> rest
> >>>>>> of the file's style.
> >>>>>>
> >>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>>>> index a9be5d33a291..2ea4430d5dcc 100644
> >>>>>> --- a/hw/i386/pc.c
> >>>>>> +++ b/hw/i386/pc.c
> >>>>>> @@ -868,10 +868,8 @@ static void 
> >>>>>> x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>                                            uint64_t pci_hole64_size)
> >>>>>>  {
> >>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>> -    uint32_t eax, vendor[3];
> >>>>>>
> >>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>> +    if (!qemu_amd_iommu_is_present()) {
> >>>>>>          return;
> >>>>>>      }
> >>>>>>
> >>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
> >>>>>> --- a/include/qemu/osdep.h
> >>>>>> +++ b/include/qemu/osdep.h
> >>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>>>>>   */
> >>>>>>  size_t qemu_get_host_physmem(void);
> >>>>>>
> >>>>>> +/**
> >>>>>> + * qemu_amd_iommu_is_present:
> >>>>>> + *
> >>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
> >>>>>> + * is present.
> >>>>>> + *
> >>>>>> + */
> >>>>>> +bool qemu_amd_iommu_is_present(void);
> >>>>>> +
> >>>>>>  /*
> >>>>>>   * Toggle write/execute on the pages marked MAP_JIT
> >>>>>>   * for the current thread.
> >>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>>> index f2be7321c59f..54cef21217c4 100644
> >>>>>> --- a/util/oslib-posix.c
> >>>>>> +++ b/util/oslib-posix.c
> >>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>>>>>  #endif
> >>>>>>      return 0;
> >>>>>>  }
> >>>>>> +
> >>>>>> +bool qemu_amd_iommu_is_present(void)
> >>>>>> +{
> >>>>>> +    bool found = false;
> >>>>>> +#ifdef CONFIG_LINUX
> >>>>>> +    struct dirent *entry;
> >>>>>> +    char *path;
> >>>>>> +    DIR *dir;
> >>>>>> +
> >>>>>> +    path = g_strdup_printf("/sys/class/iommu");
> >>>>>> +    dir = opendir(path);
> >>>>>> +    if (!dir) {
> >>>>>> +        g_free(path);
> >>>>>> +        return found;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    do {
> >>>>>> +            entry = readdir(dir);
> >>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >>>>>> +                found = true;
> >>>>>> +                break;
> >>>>>> +            }
> >>>>>> +    } while (entry);
> >>>>>> +
> >>>>>> +    g_free(path);
> >>>>>> +    closedir(dir);
> >>>>>> +#endif
> >>>>>> +    return found;
> >>>>>> +}
> >>>>>
> >>>>> why are we checking whether an AMD IOMMU is present
> >>>>> on the host? 
> >>>>> Isn't what we care about whether it's
> >>>>> present in the VM? What we are reserving after all
> >>>>> is a range of GPAs, not actual host PA's ...
> >>>>>
> >>>> Right.
> >>>>
> >>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >>>> and so we need to not map that portion of address space entirely
> >>>> and use some other portion of the GPA-space. This has to
> >>>> do with host IOMMU which is where the DMA mapping of guest PA takes
> >>>> place. So, if you do not have an host IOMMU, you can't
> >>>> service guest DMA/PCIe services via VFIO through the host IOMMU, 
> >>>> therefore you
> >>>> don't need this problem.
> >>>>
> >>>> Whether one uses a guest IOMMU or not does not affect the result,
> >>>> it would be more of a case of mimicking real hardware not fixing
> >>>> the issue of this series.
> >>>
> >>>
> >>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> >>> And ideally move the logic reporting reserved ranges there too.
> >>> However, I think vdpa has the same issue too.
> >>> CC Jason for an opinion.
> >>
> >> It does have the same problem.
> >>
> >> This is not specific to VFIO, it's to anything that uses the iommu.
> > 
> > Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> > that we don't have a global "enable-vfio" flag since vfio devices
> > can be hot-plugged. I think this is not the first time we wanted
> > something like this, right Alex?
> > 
> >> It's
> >> just that VFIO doesn't let you shoot in the foot :)
> >>
> >> vDPA would need to validate iova ranges as well to prevent mapping on
> >> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
> >> map request is within a valid iova range"). Now you could argue that
> >> it's an IOMMU core problem, maybe.
> >>
> >> But I this as a separate problem,
> >> because regardless of said validation, your guest provisioning
> >> is still going to fail for guests with >=1010G of max GPA and even if
> >> it doesn't fail you shouldn't be letting it DMA map those in the first
> >> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
> >> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
> > 
> > I wonder what's the status on a system without an IOMMU.
> > 
> In theory you should be OK. Also it's worth keeping in mind that AMD machines
> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
> is the initiator should be fine on worst case scenario.

Not sure I understand this last sentence.

> >>> Also, some concerns
> >>> - I think this changes memory layout for working VMs not using VFIO.
> >>>   Better preserve the old layout for old machine types?
> >>>
> >> Oh definitely, and I do that in this series. See the last commit, and
> >> in the past it was also a machine-property exposed to userspace.
> >> Otherwise I would be breaking (badly) compat/migration.
> >>
> >> I would like to emphasize that these memory layout changes are *exclusive* 
> >> and
> >> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
> >> It's not every guest, but a fairly limited subset of big-memory guests that
> >> are not the norm.
> >>
> >> Albeit the phys-bits property errors might gives us a bit more trouble, so
> >> it might help being more conservative.
> >>
> >>> - You mention the phys bits issue very briefly, and it's pretty
> >>>   concerning.  Do we maybe want to also disable the work around if phys
> >>>   bits is too small? 
> >>
> >> We are doing that here (well, v4), as suggested by Igor. Note that in this 
> >> series
> >> it's a warning, but v4 I have it as an error and with the 32-bit issues 
> >> that
> >> I found through qtest.
> >>
> >> I share the same concern as you over making this an error because of 
> >> compatibility.
> >> Perhaps, we could go back to the previous version and turn back into a 
> >> warning and
> >> additionally even disabling the relocation all together. Or if desired even
> >> depending on a machine-property to enable it.
> > 
> > I would be inclined to disable the relocation. And maybe block vfio? I'm
> > not 100% sure about that last one, but this can be a vfio decision to
> > make.
> > 
> I don't see this as a VFIO problem (VFIO is actually being a good citizen and 
> doing the
> right thing :)). From my perspective this fix is also useful to vDPA (which 
> we also care),
> and thus will also let us avoid DMA mapping in these GPA ranges.
> 
> The reason I mention VFIO in cover letter is that it's one visible UAPI 
> change that
> users will notice that suddenly their 1TB+ VMs break.

What I mean is that most VMs don't use either VFIO or VDPA.
If we had VFIO,VDPA as an accelerator that needs to be listed
upfront when qemu is started, we could solve this with
a bit less risk by not changing anything for VMs without these two.

Alex what do you think about adding this?

Given we do not have such a thing right now, one can get
into a situation where phys bits limit CPU, then
more memory is supplied than would fit above reserved region, then
we stick the memory over the reserved region, and finally
then vfio device is added.

In this last configuration, should vfio device add fail?
Or just warn and try to continue? I think we can code this last
decision as part of vfio code and then Alex will get to decide ;)
And yes, a similar thing with vdpa.


> >>> Also, we'd need to test a bunch of old
> >>>   guests to see what happens. Which guests were tested? 
> >>>
> >> Do you envision that old guests would run (or care) into this sort of 1TB 
> >> config? Mainly
> >> testing more modern guests on Linux (>= 4.14) over this last set of 
> >> versions, and in the
> >> past Windows 2012+. Let me be extra extra pedantic on this part for the 
> >> next submission
> >> (and report back if something odd happens).
> > 
> > Not 100% sure but let's at least document.
> > 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]