[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] ppc/spapr: Advertise StoreEOI for POWER10 compat guests

From: Daniel Henrique Barboza
Subject: Re: [PATCH] ppc/spapr: Advertise StoreEOI for POWER10 compat guests
Date: Thu, 17 Feb 2022 18:03:08 -0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

On 2/17/22 10:23, Cédric Le Goater wrote:
On 2/17/22 12:28, Daniel Henrique Barboza wrote:

On 2/14/22 11:11, Cédric Le Goater wrote:
When an interrupt has been handled, the OS notifies the interrupt
controller with a EOI sequence. On a POWER9 and POWER10 systems using
the XIVE interrupt controller, this can be done with a load or a store
operation on the ESB interrupt management page of the interrupt. The
StoreEOI operation has less latency and improves interrupt handling
performance but it was deactivated during the POWER9 DD2.0 timeframe
because of ordering issues. POWER9 systems use the LoadEOI instead.
POWER10 compat guests should have fixed the issue with
Load-after-Store ordering and StoreEOI can be activated for them

To maintain performance, this ordering is only enforced for the
XIVE_ESB_SET_PQ_10 load operation. This operation can be used to
disable temporarily an interrupt source. If StoreEOI is active, a
source could be left enabled if the load and store operations come
out of order.

Add a check in our XIVE emulation model for Load-after-Store when
StoreEOI is active. It should catch unreliable sequences. Other load
operations should be fine without it.

Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Unfortunetaly, this patch breaks migration under TCG because the XIVE
source flag is not updated on the target side. KVM is not impacted
because the emulated sources are not used. This needs to be addressed
in a v2.

That said, even without this patch, TCG migration is broken. some CPUs
on the receive side are stalled on CPU Hard LOCKUPs. QEMU 6.2 is impacted.
So it has been a while :/

I've done a few tests and I can see Hard Lockups with TCG pseries migration, 
when using
multiples CPUs (I used -smp 4 like you suggested in private), since at least 

This is hardly surprising since TCG migration isn't something that we ever 
supported in
a product or even in the community*. It would be good to understand why and get 
it fixed,
but for now we can take a bit comfort in knowing that:

- it has been broken for awhile (if ever worked). If this was a recent 7.0 
we would need to solve it for this upcoming release;

- single CPU TCG migration seems to be working fine, so we can count with this 
migration scenario for testing.

* I'm hoping David and Greg can push back on this if my assumption is wrong.



See below.


[   24.113608] watchdog: CPU 0 detected hard LOCKUP on other CPUs 1,3
[   24.116534] watchdog: CPU 0 TB:15585461459, last SMP heartbeat TB:7394335409 
(15998ms ago)
[   24.117840] watchdog: CPU 1 Hard LOCKUP
[   24.117956] watchdog: CPU 1 TB:15587843000, last heartbeat TB:5355690415 
(19984ms ago)
[   24.117999] Modules linked in:
[   24.118387] irq event stamp: 341399
[   24.118399] hardirqs last  enabled at (341399): [<c000000000caea2c>] 
[   24.118900] hardirqs last disabled at (341398): [<c000000000208b9c>] 
[   24.118943] softirqs last  enabled at (9798): [<c000000000f97dfc>] 
[   24.118971] softirqs last disabled at (9789): [<c0000000001b06f8>] 
[   24.119127] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.17.0-rc4-dirty #984
[   24.119293] NIP:  c000000000caea78 LR: c000000000caea38 CTR: c000000000cae990
[   24.119315] REGS: c0000000fff43d60 TRAP: 0100   Not tainted  
[   24.119352] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 
28000228  XER: 00000006
[   24.119554] CFAR: c000000000caea98 IRQMASK: 0
[   24.119554] GPR00: c000000000caea2c c000000002bbbd80 c000000001c30b00 
[   24.119554] GPR04: 0000000000000006 0000000000000000 000000000000c800 
[   24.119554] GPR08: c000000002b5d500 0000000000000000 00000003a115ef39 
[   24.119554] GPR12: c000000000cae990 c0000000fffff300 0000000000000000 
[   24.119554] GPR16: 0000000000000000 0000000000000000 0000000000000000 
[   24.119554] GPR20: 0000000000000000 0000000000000000 0000000000000000 
[   24.119554] GPR24: c0000000ffa4fb48 000000059d7c5070 c000000001c78e48 
[   24.119554] GPR28: c000000001b3a660 c0000000015422e0 c0000000015422e8 
[   24.119845] NIP [c000000000caea78] snooze_loop+0xe8/0x290
[   24.119866] LR [c000000000caea38] snooze_loop+0xa8/0x290
[   24.119998] Call Trace:
[   24.120029] [c000000002bbbd80] [c000000000caea2c] snooze_loop+0x9c/0x290 
[   24.120097] [c000000002bbbdc0] [c000000000cab730] 
[   24.120119] [c000000002bbbe30] [c000000000cabbfc] cpuidle_enter+0x4c/0x70
[   24.120131] [c000000002bbbe70] [c000000000208d98] do_idle+0x328/0x450
[   24.120141] [c000000002bbbf00] [c00000000020926c] cpu_startup_entry+0x3c/0x40
[   24.120150] [c000000002bbbf30] [c00000000005e144] start_secondary+0x2a4/0x2b0
[   24.120161] [c000000002bbbf90] [c00000000000d054] 
[   24.120238] Instruction dump:
[   24.120320] e9280080 e8c7d148 3ce20005 71290004 38e7d138 7d4a3214 4082003c 
[   24.120357] 60000000 60420000 7c210b78 7ffffb78 <8927000c> 2c090000 41820010 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]