qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable


From: Joao Martins
Subject: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Date: Thu, 24 Feb 2022 19:44:26 +0000

On 2/24/22 18:30, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>> +{
>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>> +
>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>
>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>> It's really about what we present to the guest though,
>>>>>> isn't it?
>>>>>
>>>>> It was the easier catch all to use cpuid without going into
>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>> for systems with an IOMMU present.
>>>>>
>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>
>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
>>>>> think it's
>>>>> even worth checking the range exists in:
>>>>>
>>>>>   /sys/kernel/iommu_groups/0/reserved_regions
>>>>>
>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>
>>>> Here's what I have staged in local tree, to address your comment.
>>>>
>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>> all the tests and what not.
>>>>
>>>> I am not entirely sure this is the right place to put such a helper, open
>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>>>> of the file's style.
>>>>
>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>> index a9be5d33a291..2ea4430d5dcc 100644
>>>> --- a/hw/i386/pc.c
>>>> +++ b/hw/i386/pc.c
>>>> @@ -868,10 +868,8 @@ static void 
>>>> x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>                                            uint64_t pci_hole64_size)
>>>>  {
>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> -    uint32_t eax, vendor[3];
>>>>
>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>> -    if (!IS_AMD_VENDOR(vendor)) {
>>>> +    if (!qemu_amd_iommu_is_present()) {
>>>>          return;
>>>>      }
>>>>
>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
>>>> --- a/include/qemu/osdep.h
>>>> +++ b/include/qemu/osdep.h
>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>>>   */
>>>>  size_t qemu_get_host_physmem(void);
>>>>
>>>> +/**
>>>> + * qemu_amd_iommu_is_present:
>>>> + *
>>>> + * Operating system agnostic way of querying if an AMD IOMMU
>>>> + * is present.
>>>> + *
>>>> + */
>>>> +bool qemu_amd_iommu_is_present(void);
>>>> +
>>>>  /*
>>>>   * Toggle write/execute on the pages marked MAP_JIT
>>>>   * for the current thread.
>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>> index f2be7321c59f..54cef21217c4 100644
>>>> --- a/util/oslib-posix.c
>>>> +++ b/util/oslib-posix.c
>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>>>  #endif
>>>>      return 0;
>>>>  }
>>>> +
>>>> +bool qemu_amd_iommu_is_present(void)
>>>> +{
>>>> +    bool found = false;
>>>> +#ifdef CONFIG_LINUX
>>>> +    struct dirent *entry;
>>>> +    char *path;
>>>> +    DIR *dir;
>>>> +
>>>> +    path = g_strdup_printf("/sys/class/iommu");
>>>> +    dir = opendir(path);
>>>> +    if (!dir) {
>>>> +        g_free(path);
>>>> +        return found;
>>>> +    }
>>>> +
>>>> +    do {
>>>> +            entry = readdir(dir);
>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>>>> +                found = true;
>>>> +                break;
>>>> +            }
>>>> +    } while (entry);
>>>> +
>>>> +    g_free(path);
>>>> +    closedir(dir);
>>>> +#endif
>>>> +    return found;
>>>> +}
>>>
>>> why are we checking whether an AMD IOMMU is present
>>> on the host? 
>>> Isn't what we care about whether it's
>>> present in the VM? What we are reserving after all
>>> is a range of GPAs, not actual host PA's ...
>>>
>> Right.
>>
>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>> and so we need to not map that portion of address space entirely
>> and use some other portion of the GPA-space. This has to
>> do with host IOMMU which is where the DMA mapping of guest PA takes
>> place. So, if you do not have an host IOMMU, you can't
>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore 
>> you
>> don't need this problem.
>>
>> Whether one uses a guest IOMMU or not does not affect the result,
>> it would be more of a case of mimicking real hardware not fixing
>> the issue of this series.
> 
> 
> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> And ideally move the logic reporting reserved ranges there too.
> However, I think vdpa has the same issue too.
> CC Jason for an opinion.

It does have the same problem.

This is not specific to VFIO, it's to anything that uses the iommu. It's
just that VFIO doesn't let you shoot in the foot :)

vDPA would need to validate iova ranges as well to prevent mapping on
reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
map request is within a valid iova range"). Now you could argue that
it's an IOMMU core problem, maybe.

But I this as a separate problem,
because regardless of said validation, your guest provisioning
is still going to fail for guests with >=1010G of max GPA and even if
it doesn't fail you shouldn't be letting it DMA map those in the first
place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
from IOMMU if DMA is attempted over the first megabyte of that 1T hole).

> Also, some concerns
> - I think this changes memory layout for working VMs not using VFIO.
>   Better preserve the old layout for old machine types?
> 
Oh definitely, and I do that in this series. See the last commit, and
in the past it was also a machine-property exposed to userspace.
Otherwise I would be breaking (badly) compat/migration.

I would like to emphasize that these memory layout changes are *exclusive* and
*only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
It's not every guest, but a fairly limited subset of big-memory guests that
are not the norm.

Albeit the phys-bits property errors might gives us a bit more trouble, so
it might help being more conservative.

> - You mention the phys bits issue very briefly, and it's pretty
>   concerning.  Do we maybe want to also disable the work around if phys
>   bits is too small? 

We are doing that here (well, v4), as suggested by Igor. Note that in this 
series
it's a warning, but v4 I have it as an error and with the 32-bit issues that
I found through qtest.

I share the same concern as you over making this an error because of 
compatibility.
Perhaps, we could go back to the previous version and turn back into a warning 
and
additionally even disabling the relocation all together. Or if desired even
depending on a machine-property to enable it.

> Also, we'd need to test a bunch of old
>   guests to see what happens. Which guests were tested? 
> 
Do you envision that old guests would run (or care) into this sort of 1TB 
config? Mainly
testing more modern guests on Linux (>= 4.14) over this last set of versions, 
and in the
past Windows 2012+. Let me be extra extra pedantic on this part for the next 
submission
(and report back if something odd happens).



reply via email to

[Prev in Thread] Current Thread [Next in Thread]