qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cxl nvdimm Potential probe ordering issues.


From: Dan Williams
Subject: Re: cxl nvdimm Potential probe ordering issues.
Date: Fri, 20 Jan 2023 09:26:34 -0800

Jonathan Cameron wrote:
> On Thu, 19 Jan 2023 15:04:49 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Thu, 19 Jan 2023 12:42:44 +0000
> > Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
> > 
> > > On Wed, 18 Jan 2023 14:31:53 -0500
> > > Gregory Price <gregory.price@memverge.com> wrote:
> > >   
> > > > I apparently forgot an intro lol
> > > > 
> > > > I tested the DOE linux branch with the 2023-1-11 QEMU branch with both
> > > > volatile, non-volatile, and "legacy" (pre-my-patch) non-volatile mode.
> > > > 
> > > > 1) *In volatile mode, there are no stack traces present (during boot*)
> > > > 
> > > > On Wed, Jan 18, 2023 at 02:22:08PM -0500, Gregory Price wrote:    
> > > > > 
> > > > > 1) No stack traces present
> > > > > 2) Device usage appears to work, but cxl-cli fails to create a 
> > > > > region, i
> > > > > haven't checked why yet (also tried ndctl-75, same results)
> > > > > 3) There seems to be some other regression with the cxl_pmem_init
> > > > > routine, because I get a stack trace in this setup regardless of 
> > > > > whether
> > > > > I apply the type-3 device commit.
> > > > > 
> > > > > 
> > > > > All tests below with the previously posted DOE linux branch.
> > > > > Base QEMU branch was Jonathan's 2023-1-11
> > > > > 
> > > > > 
> > > > > DOE Branch - 2023-1-11 (HEAD) (all commits)
> > > > > 
> > > > > QEMU Config:
> > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > > -drive 
> > > > > file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd
> > > > >  \
> > > > > -m 3G,slots=4,maxmem=8G \
> > > > > -smp 4 \
> > > > > -machine type=q35,accel=kvm,cxl=on \
> > > > > -enable-kvm \
> > > > > -nographic \
> > > > > -object memory-backend-ram,id=mem0,size=1G,share=on \
> > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > > -device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
> > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > > 
> > > > > Result:  This worked, but cxl-cli could not create a region (will look
> > > > > into this further later).
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > When running with a persistent memory configuration, I'm seeing a
> > > > > kernel stack trace on cxl_pmem_init
> > > > > 
> > > > > Config:
> > > > > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > > > > -drive 
> > > > > file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd
> > > > >  \
> > > > > -m 3G,slots=4,maxmem=4G \
> > > > > -smp 4 \
> > > > > -machine type=q35,accel=kvm,cxl=on \
> > > > > -enable-kvm \
> > > > > -nographic \
> > > > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > > > > -device cxl-rp,port=0,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > > > > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/mem0,size=1G \
> > > > > -object memory-backend-file,id=cxl-lsa0,mem-path=/tmp/lsa0,size=1G \
> > > > > -device 
> > > > > cxl-type3,bus=rp0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0
> > > > >  \
> > > > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G
> > > > > 
> > > > > 
> > > > > [   62.167518] BUG: kernel NULL pointer dereference, address: 
> > > > > 00000000000004c0
> > > > > [   62.185069] #PF: supervisor read access in kernel mode
> > > > > [   62.198502] #PF: error_code(0x0000) - not-present page
> > > > > [   62.211019] PGD 0 P4D 0
> > > > > [   62.220521] Oops: 0000 [#1] PREEMPT SMP PTI
> > > > > [   62.233457] CPU: 3 PID: 558 Comm: systemd-udevd Not tainted 
> > > > > 6.2.0-rc1+ #1
> > > > > [   62.252886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
> > > > > BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
> > > > > [   62.258432] Adding 2939900k swap on /dev/zram0.  Priority:100 
> > > > > extents:1 across:2939900k SSDscFS
> > > > > [   62.285513] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > > > [   62.285529] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 
> > > > > 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 
> > > > > 4c 89 0
> > > > > [   62.285531] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > > > [   62.285534] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 
> > > > > 0000000000000000
> > > > > [   62.285536] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 
> > > > > 00000000ffffffff
> > > > > [   62.285537] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 
> > > > > 0000000000000001
> > > > > [   62.285538] R10: 0000000000000001 R11: 0000000000000000 R12: 
> > > > > ffff97a8a37b8000
> > > > > [   62.285539] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 
> > > > > 0000000000000000
> > > > > [   62.285541] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) 
> > > > > knlGS:0000000000000000
> > > > > [   62.285542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   62.285544] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 
> > > > > 00000000000006e0
> > > > > [   62.285653] Call Trace:
> > > > > [   62.285656]  <TASK>
> > > > > [   62.285660]  cxl_bus_probe+0x17/0x50
> > > > > [   62.285691]  really_probe+0xde/0x380
> > > > > [   62.285695]  ? pm_runtime_barrier+0x50/0x90
> > > > > [   62.285700]  __driver_probe_device+0x78/0x170
> > > > > [   62.285846]  driver_probe_device+0x1f/0x90
> > > > > [   62.285850]  __driver_attach+0xd2/0x1c0
> > > > > [   62.285853]  ? __pfx___driver_attach+0x10/0x10
> > > > > [   62.285856]  bus_for_each_dev+0x76/0xa0
> > > > > [   62.285860]  bus_add_driver+0x1b1/0x200
> > > > > [   62.285863]  driver_register+0x89/0xe0
> > > > > [   62.285868]  ? __pfx_init_module+0x10/0x10 [cxl_pmem]
> > > > > [   62.285874]  cxl_pmem_init+0x50/0xff0 [cxl_pmem]
> > > > > [   62.285880]  do_one_initcall+0x6e/0x330
> > > > > [   62.285888]  do_init_module+0x4a/0x200
> > > > > [   62.285892]  __do_sys_finit_module+0x93/0xf0
> > > > > [   62.285899]  do_syscall_64+0x5b/0x80
> > > > > [   62.285904]  ? do_syscall_64+0x67/0x80
> > > > > [   62.285906]  ? asm_exc_page_fault+0x22/0x30
> > > > > [   62.285910]  ? lockdep_hardirqs_on+0x7d/0x100
> > > > > [   62.285914]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > > > [   62.285917] RIP: 0033:0x7f2619b0afbd
> > > > > [   62.285920] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e 
> > > > > fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 
> > > > > 08 0f 8
> > > > > [   62.285922] RSP: 002b:00007ffcc516bf58 EFLAGS: 00000246 ORIG_RAX: 
> > > > > 0000000000000139
> > > > > [   62.285924] RAX: ffffffffffffffda RBX: 00005557c0dcaa60 RCX: 
> > > > > 00007f2619b0afbd
> > > > > [   62.285925] RDX: 0000000000000000 RSI: 00007f261a18743c RDI: 
> > > > > 0000000000000006
> > > > > [   62.285926] RBP: 00007f261a18743c R08: 0000000000000000 R09: 
> > > > > 00007f261a17bb52
> > > > > [   62.285927] R10: 0000000000000006 R11: 0000000000000246 R12: 
> > > > > 0000000000020000
> > > > > [   62.285928] R13: 00005557c0dbbce0 R14: 0000000000000000 R15: 
> > > > > 00005557c0dc18a0
> > > > > [   62.285932]  </TASK>
> > > > > [   62.285933] Modules linked in: cxl_pmem(+) snd_pcm libnvdimm 
> > > > > snd_timer snd joydev bochs cxl_mem drm_vram_helper parport_pc 
> > > > > soundcore drm_ttm_g
> > > > > [   62.285954] CR2: 00000000000004c0
> > > > > [   62.288385] ---[ end trace 0000000000000000 ]---
> > > > > [   63.203514] RIP: 0010:cxl_nvdimm_probe+0x8d/0x130 [cxl_pmem]
> > > > > [   63.203562] Code: 85 c0 0f 85 90 00 00 00 f0 80 0c 24 40 f0 80 4c 
> > > > > 24 08 10 f0 80 4c 24 08 20 f0 80 4c 24 08 40 49 8d 84 24 b8 04 00 00 
> > > > > 4c 89 0
> > > > > [   63.203565] RSP: 0018:ffffacff0141fc38 EFLAGS: 00010202
> > > > > [   63.203570] RAX: ffff97a8a37b84b8 RBX: ffff97a8a37b8000 RCX: 
> > > > > 0000000000000000
> > > > > [   63.203572] RDX: 0000000000000001 RSI: ffff97a8a37b8000 RDI: 
> > > > > 00000000ffffffff
> > > > > [   63.203574] RBP: ffff97a8a37b8000 R08: 0000000000000001 R09: 
> > > > > 0000000000000001
> > > > > [   63.203576] R10: 0000000000000001 R11: 0000000000000000 R12: 
> > > > > ffff97a8a37b8000
> > > > > [   63.203577] R13: ffff97a982c3dc28 R14: 0000000000000000 R15: 
> > > > > 0000000000000000
> > > > > [   63.203580] FS:  00007f2619829580(0000) GS:ffff97a9bca00000(0000) 
> > > > > knlGS:0000000000000000
> > > > > [   63.203583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   63.203585] CR2: 00000000000004c0 CR3: 00000001056a8000 CR4: 
> > > > > 00000000000006e0    
> > > 
> > > Possibly replicated.  What I did was stop cxl_pmem.ko being probed 
> > > automatically and
> > > added it manually later. Trace that results is certainly similar to yours.
> > > 
> > > Now the MODULE_SOFTDEP() in drivers/cxl/acpi.c should stop that happening
> > > assuming you are letting autoloading run.
> > > I wonder if there is a path in which it doesn't?  
> > 
> > Gregory, would you mind checking if
> > cxl_nvb is NULL here...
> > https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/pmem.c#L67
> > (printk before it is used should work).
> > 
> > Might also be worth checking cxl_nvd and cxl_ds
> > but my guess is cxl_nvb is our problem (it is when I deliberate change
> > the load order).
> 
> Whilst I still have no idea if this is the same problem, I have identified
> what goes wrong if there is a module probe ordering issue.
> https://elixir.bootlin.com/linux/v6.2-rc4/source/drivers/cxl/core/pmem.c#L306
> 
>       /*
>        * The two actions below arrange for @cxl_nvd to be deleted when either
>        * the top-level PMEM bridge goes down, or the endpoint device goes
>        * through ->remove().
>        */
>       device_lock(&cxl_nvb->dev);
>       if (cxl_nvb->dev.driver)
>               rc = devm_add_action_or_reset(&cxl_nvb->dev, cxl_nvd_unregister,
>                                             cxl_nvd);
>       else
> // bridge driver not loaded, so we hit this path.
>               rc = -ENXIO;
>       device_unlock(&cxl_nvb->dev);
> 
>       if (rc)
> /// and this one
>               goto err_alloc;
> 
>       /* @cxlmd carries a reference on @cxl_nvb until cxlmd_release_nvdimm */
>       return devm_add_action_or_reset(&cxlmd->dev, cxlmd_release_nvdimm, 
> cxlmd);
> 
> err:
>       put_device(dev);
> err_alloc:
>       cxlmd->cxl_nvb = NULL;
>       cxlmd->cxl_nvd = NULL;
>       put_device(&cxl_nvb->dev);
> // whilst we scrub the pointers we don't actually get rid of the
> // cxl_nvd that we registered.  Hence later load of the driver tries to
> // attach to that and boom because we've scrubbed these pointers here.
> // A quick hack is to just call device_del(&cxl_nvd->dev) if rc = -ENXIO here.
> // There may well be a races though....
>       return rc;
> }
> EXPORT_SYMBOL_NS_GPL(devm_cxl_add_nvdimm, CXL);
> 
> 
> Of course this "fix" just stops things blowing up, it doesn't leave things
> in a remotely useful state.  If it's triggered because someone
> is messing with the load order that's fine.  If the same issue
> is occurring for Gregory, not so much. 

Apologies for not engaging on this sooner, I have been heads down on
trying to get pre-existing region enumeration on its feet.

One workaround for this is to just preclude the load order scenario from
happening by making the pmem enabling part of the core module. There is
not much reason for it to exist indepdendently.

...but I think it is useful to be able to have some policy for disabling
all pmem enabling in an emergency. So I think fixing the cxl_nvd
unregistration is the right way to go.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]