qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cxl nvdimm Potential probe ordering issues.


From: Dan Williams
Subject: Re: cxl nvdimm Potential probe ordering issues.
Date: Fri, 20 Jan 2023 09:38:13 -0800

Gregory Price wrote:
> On Thu, Jan 19, 2023 at 03:04:49PM +0000, Jonathan Cameron wrote:
> > Gregory, would you mind checking if
> > cxl_nvb is NULL here...
> > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> > (printk before it is used should work).
> > 
> > Might also be worth checking cxl_nvd and cxl_ds
> > but my guess is cxl_nvb is our problem (it is when I deliberate change
> > the load order).
> > 
> > Jonathan
> > 
> 
> This is exactly the issue.  cxl_nvb is null, the rest appear fine.
> 
> Also, note, that weirdly the non-volatile bridge shows up when launching
> this in volatile mode, but no stack trace appears.
> 
> ¯\_(ツ)_/¯
> 
> After spending way too much time tracing through the current cxl driver
> code, i have only really determined that
> 
> 1) The code is very pmem oriented, and it's unclear to me how the driver
>    as-is differentiates a persistent device from a volatile device. That
>        code path still completely escapes me.  The only differentiating code
>        i see is in the memdev probe path that creates mem#/pmem and mem#/ram

Yes, pmem was the initial focus because it had the most dependency on
the OS to setup vs BIOS, but the ram enabling is at the top of the queue
now.

> 
> 2) The code successfully manages probe, enable, and mount a REAL device
>    - cxl memdev appears (/sys/bus/cxl/devices/mem0)
>        - a dax device appears (/sys/bus/dax/devices/)
>          This happens at boot, which I assume must be bios related

As it stands currently that dax device and the cxl device are not
related since a default dax-device is loaded just based on the presence
of an EFI_MEMORY_SP address range in the address map. With the new ram
enabling that default device will be elided and CXL will register a
dax-device parented by a cxl region.

>        - The memory *does not* auto-online, instead the dax device can be
>          onlined as system-ram *manually* via ndctl and friends

That *manually* part is the problem that needs distro help to solve. It
should be the case that by default all Linux distributions auto-online
all dax-devices. If that happens to online memory that is too slow for
general use, or too high-performance / precious for general purpose use
then the administrator can set policy after the fact. Unfortunately user
policy can not be applied if these memory ranges were onlined by the
kernel at boot , so that's why the kernel policy defaults to not-online.

In other words, there is no guarantee that memory that was assigned to
the general purpose pool at boot can be removed. The only guaranteed
behavior is to never give the memory to the core kernel in the first
instance and always let user policy route the memory.

> 3) The code creates an nvdimm_bridge IFF a CFMW is defined - regardless
>    of the type-3 device configuration (pmem-only or vmem-only)

Correct, the top-level bus code (cxl_acpi) and the endpoint code
(cxl_mem, cxl_port) need to handshake before establishing regions. For
pmem regions the platform needs to claim the availability of a pmem
capable CXL window.

> 
>    # CFMW defined
>    [root@fedora ~]# ls /sys/bus/cxl/devices/
>    decoder0.0  decoder2.0  mem0            port1
>    decoder1.0  endpoint2   nvdimm-bridge0  root0
> 
>    # CFMW not defined
>        [root@fedora ~]# ls /sys/bus/cxl/devices/
>    decoder1.0  decoder2.0  endpoint2  mem0  port1  root0
> 
> 4) As you can see above, multiple decoders are registered.  I'm not sure
>    if that's correct or not, but it does seem odd given there's only one
>        cxl type-3 device.  Odd that decoder0.0 shows up when CFMW is there,
>        but not when it isn't.

CXL windows are modeled as decoders hanging off the the CXL root device
(ACPI0017 on ACPI based platforms). An endpoint decoder can then map a
selection of that window.

> 
>        Note: All these tests have two root ports:
>        -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
>    -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
>    -device cxl-rp,id=rp1,bus=cxl.0,chassis=0,port=1,slot=1 \
> 
> 
> Don't know why I haven't thought of this until now, but is the CFMW code
> reporting something odd about what's behind it?  Is it assuming the
> devices are pmem?

No, the cxl_acpi code is just advertising platform decode possibilities
independent of what devices show up. Think of this like the PCI MMIO
space that gets allocated to a root bridge at the beginning of time.
That space may or may not get consumed based on what devices show up
downstream.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]