[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI

From: David Hildenbrand
Subject: Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI
Date: Wed, 17 Nov 2021 19:08:28 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0

On 17.11.21 15:30, Jonathan Cameron wrote:
> On Tue, 16 Nov 2021 12:11:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>>>> like that yet, but it might just be a matter of time).  
>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
>>> some by MEMORY entries. Or nodes created dynamically like with normal
>>> hotplug memory.

Hi Jonathan,

> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
> to Generic Initiator Affinity.  So no good for memory. This is meant for
> representation of accelerators / network cards etc so you can get the NUMA
> characteristics for them accessing Memory in other nodes.
> My understanding of 'traditional' memory hotplug is that typically the
> PA into which memory is hotplugged is known at boot time whether or not
> the memory is physically present.  As such, you present that in SRAT and rely
> on the EFI memory map / other information sources to know the memory isn't
> there.  When it is hotplugged later the address is looked up in SRAT to 
> identify
> the NUMA node.

in virtualized environments we use the SRAT only to indicate the hotpluggable
region (-> indicate maximum possible PFN to the guest OS), the actual present
memory+PXM assignment is not done via SRAT. We differ quite a lot here from
actual hardware I think.

> That model is less useful for more flexible entities like virtio-mem or
> indeed physical hardware such as CXL type 3 memory devices which typically
> need their own nodes.
> For the CXL type 3 option, currently proposal is to use the CXL table entries
> representing Physical Address space regions to work out how many NUMA nodes
> are needed and just create extra ones at boot.
> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
> It's a heuristic as we might need more nodes to represent things well kernel
> side, but it's better than nothing and less effort that true dynamic node 
> creation.
> If you chase through the earlier versions of Alison's patch you will find some
> discussion of that.
> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
> That would make all this stuff discoverable via PCI config space rather than 
> CDAT is at:
> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> (nothing stops others using it though AFAIK).
> However, then we'd actually need either dynamic node creation in the OS, or
> some sort of reserved pool of extra nodes.  Long term it may be the most
> flexible option.

I think for virtio-mem it's actually a bit simpler:

a) The user defined on the QEMU cmdline an empty node
b) The user assigned a virtio-mem device to a node, either when 
   coldplugging or hotplugging the device.

So we don't actually "hotplug" a new node, the (possible) node is already known
to QEMU right when starting up. It's just a matter of exposing that fact to the
guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
It's seems to boild down to an ACPI limitation.

Conceptually, virtio-mem on an empty node in QEMU is not that different from
hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
an empty node. But I guess it all just doesn't work with QEMU as of now.

In current x86-64 code, we define the "hotpluggable region" in 
hw/i386/acpi-build.c via

        build_srat_memory(table_data, machine->device_memory->base,
                          hotpluggable_address_space_size, nb_numa_nodes - 1,

So we tell the guest OS "this range is hotpluggable" and "it contains to
this node unless the device says something different". From both values we
can -- when under QEMU -- conclude the maximum possible PFN and the maximum
possible node. But the latter is not what Linux does: it simply maps the last
numa node (indicated in the memory entry) to a PXM
(-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).

I do wonder if we could simply expose the same hotpluggable range via multiple 

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index a3ad6abd33..6c0ab442ea 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2084,6 +2084,22 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
      * providing _PXM method if necessary.
     if (hotpluggable_address_space_size) {
+        /*
+         * For the guest to "know" about possible nodes, we'll indicate the
+         * same hotpluggable region to all empty nodes.
+         */
+        for (i = 0; i < nb_numa_nodes - 1; i++) {
+            if (machine->numa_state->nodes[i].node_mem > 0) {
+                continue;
+            }
+            build_srat_memory(table_data, machine->device_memory->base,
+                              hotpluggable_address_space_size, i,
+                              MEM_AFFINITY_HOTPLUGGABLE | 
+        }
+        /*
+         * Historically, we always indicated all hotpluggable memory to the
+         * last node -- if it was empty or not.
+         */
         build_srat_memory(table_data, machine->device_memory->base,
                           hotpluggable_address_space_size, nb_numa_nodes - 1,

Of course, this won't make CPU hotplug to empty nodes happy if we don't have
mempory hotplug enabled for a VM. I did not check in detail if that is valid
according to ACPI -- Linux might eat it (did not try yet, though).


David / dhildenb

reply via email to

[Prev in Thread] Current Thread [Next in Thread]