[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XIVE VFIO kernel resample failure in INTx mode under heavy load

From: Cédric Le Goater
Subject: Re: XIVE VFIO kernel resample failure in INTx mode under heavy load
Date: Thu, 14 Apr 2022 14:31:37 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0

Hello Alexey,

Thanks for taking over.

On 4/13/22 06:56, Alexey Kardashevskiy wrote:

On 3/17/22 06:16, Cédric Le Goater wrote:

On 3/16/22 17:29, Cédric Le Goater wrote:

I've been struggling for some time with what is looking like a
potential bug in QEMU/KVM on the POWER9 platform.  It appears that
in XIVE mode, when the in-kernel IRQ chip is enabled, an external
device that rapidly asserts IRQs via the legacy INTx level mechanism
will only receive one interrupt in the KVM guest.

Indeed. I could reproduce with a pass-through PCI adapter using
'pci=nomsi'. The virtio devices operate correctly but the network
adapter only receives one event (*):

$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4 CPU5       CPU6    
  16:       2198       1378       1519       1216          0 0          0       
   0  XIVE-IPI   0 Edge      IPI-0
  17:          0          0          0          0       2003 1936       1335    
   1507  XIVE-IPI   1 Edge      IPI-1
  18:          0       6401          0          0          0 0          0       
   0  XIVE-IRQ 4609 Level     virtio3, virtio0, virtio2
  19:          0          0          0          0          0 204          0     
     0  XIVE-IRQ 4610 Level     virtio1
  20:          0          0          0          0          0 0          0       
   0  XIVE-IRQ 4608 Level     xhci-hcd:usb1
  21:          0          1          0          0          0 0          0       
   0  XIVE-IRQ 4612 Level     eth1 (*)
  23:          0          0          0          0          0 0          0       
   0  XIVE-IRQ 4096 Edge      RAS_EPOW
  24:          0          0          0          0          0 0          0       
   0  XIVE-IRQ 4592 Edge      hvc_console
  26:          0          0          0          0          0 0          0       
   0  XIVE-IRQ 4097 Edge      RAS_HOTPLUG

Changing any one of those items appears to avoid the glitch, e.g. XICS

XICS is very different from XIVE. The driver implements the previous
interrupt controller architecture (P5-P8) and the hypervisor mediates
the delivery to the guest. With XIVE, vCPUs are directly signaled by
the IC. When under KVM, we use different KVM devices for each mode :

* KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
   not using the XIVE native interface. RHEL7 for instance.
* KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
   using the XIVE native interface. Linux > 4.14.
* KVM XICS is for P8 hosts (no XIVE HW)

VFIO adds some complexity with the source events. I think the problem
comes from the assertion state. I will talk about it later.

mode with the in-kernel IRQ chip works (all interrupts are passed

All interrupts are passed through using XIVE also. Run 'info pic' in
the monitor. On the host, check the IRQ mapping in the debugfs file :


and XIVE mode with the in-kernel IRQ chip disabled also works.

In that case, no KVM device backs the QEMU device and all state
is in one place.

are also not seeing any problems in XIVE mode with the in-kernel
chip from MSI/MSI-X devices.

Yes. pass-through devices are expected to operate correctly :)

The device in question is a real time card that needs to raise an
interrupt every 1ms.  It works perfectly on the host, but fails in
the guest -- with the in-kernel IRQ chip and XIVE enabled, it
receives exactly one interrupt, at which point the host continues to
see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
kernel is never reentered.

ok. Same symptom as the scenario above.

We have also seen some very rare glitches where, over a long period
of time, we can enter a similar deadlock in XICS mode.

with the in-kernel XICS IRQ chip ?

the in-kernel IRQ chip in XIVE mode will also lead to the lockup
with this device, since the userspace IRQ emulation cannot keep up
with the rapid interrupt firing (measurements show around 100ms
required for processing each interrupt in the user mode).

MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
it is handled as a special case in the device/driver. To maintain the
assertion state, all LSI handling is done with a special HCALL :
H_INT_ESB which is implemented in QEMU. This generates a lot of back
and forth in the KVM stack.

My understanding is the resample mechanism does some clever tricks
with level IRQs, but that QEMU needs to check if the IRQ is still
asserted by the device on guest EOI.

Yes. the problem is in that area.

Since a failure here would
explain these symptoms I'm wondering if there is a bug in either
QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
resampled and therefore not re-fired in the guest?

KVM I would say. The assertion state is maintained in KVM for the KVM
XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
device. These are good candidates. I will take a look.

All works fine with KVM_CAP_IRQFD_RESAMPLE=false in QEMU. Can you please
try this workaround for now ? I could reach 934 Mbits/sec on the passthru

I clearly overlooked that part and it has been 3 years.

Disabling KVM_CAP_IRQFD_RESAMPLE on XIVE-capable machines seems to be the right 
fix actually.

XIVE == baremetal/vm POWER9 and newer.
XICS == baremetal/vm POWER8 and older, or VMs on any POWER (backward compat.).

yes. You can force XICS on POWER9 using 'max-cpu-compat' or 'ic-mode'.

Tested on POWER9 with a passed through XHCI host and "-append pci=nomsi" and 
"-machine pseries,ic-mode=xics,kernel_irqchip=on" (and s/xics/xive/).

ok. This is deactivating the default XIVE (P9+) mode at the platform level,
hence forcing the XICS (P8) mode in a POWER9 guest running on a POWER9 host.
It is also deactivating MSI, forcing INTx usage in the kernel and forcing
the use of the KVM irqchip device to make sure we are not emulating in QEMU.

We are far from the default scenario but this is it !

When it is XIVE-on-XIVE (host and guest are XIVE),

We call this mode : XIVE native, or exploitation, mode. Anyhow, it is always
XIVE under the hood on a POWER9/POWER10 box.

INTx is emulated in the QEMU's H_INT_ESB handler

LSI are indeed all handled at the QEMU level with the H_INT_ESB hcall.
If I remember well, this is because we wanted a simple way to synthesize
the interrupt trigger upon EOI when the level is still asserted. Doing
this way is compatible for both kernel_irqchip=off/on modes because the
level is maintained in QEMU.

This is different for the other two XICS KVM devices which maintain the
assertion level in KVM.

and IRQFD_RESAMPLE is just useless in such case (as it is designed to eliminate going to the 
userspace for the EOI->INTx unmasking) and there is no pathway to call the eventfd's 
irqfd_resampler_ack() from QEMU. So the VM's XHCI device receives exactly 1 interrupt and 
that is it. "kernel_irqchip=off" fixes it (obviously).


When it is XICS-on-XIVE (host is XIVE and guest is XICS),

yes (FYI, we have similar glue in skiboot ...)

then the VM receives 100000 interrupts and then it gets frozen 
(__report_bad_irq() is called). Which happens because (unlike XICS-on-XICS), 
the host XIVE's xive_(rm|vm)_h_eoi() does not call irqfd_resampler_ack(). This 
fixes it:

diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
index b0015e05d99a..9f0d8e5c7f4b 100644
--- a/arch/powerpc/kvm/book3s_xive_template.c
+++ b/arch/powerpc/kvm/book3s_xive_template.c
@@ -595,6 +595,8 @@ X_STATIC int GLUE(X_PFX,h_eoi)(struct kvm_vcpu *vcpu, 
unsigned long xirr)
         xc->hw_cppr = xc->cppr;
         __x_writeb(xc->cppr, __x_tima + TM_QW1_OS + TM_CPPR);

+       kvm_notify_acked_irq(vcpu->kvm, 0, irq);
         return rc;

OK. XICS-on-XIVE is also broken then :/ what about XIVE-on-XIVE ?

The host's XICS does call kvm_notify_acked_irq() (I did not test that but the 
code seems to be doing so).

After re-reading what I just wrote, I am leaning towards disabling use of 
KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never since :)

and it would fix XIVE-on-XIVE.

Are you saying that passthru on POWER8 is broken ? fully or only INTx ?

Did I miss something in the picture (hey Cedric)?

You seem to have all combination in mind: host OS, KVM, QEMU, guest OS

For the record, here is a documentation we did:


It might need some updates.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]