qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip


From: Peter Xu
Subject: Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
Date: Thu, 27 Feb 2020 14:19:00 -0500

On Thu, Feb 27, 2020 at 07:22:08PM +0100, Auger Eric wrote:
> Hi Peter,
> 
> On 2/27/20 7:00 PM, Peter Xu wrote:
> > On Thu, Feb 27, 2020 at 06:42:09PM +0100, Auger Eric wrote:
> >> Hi Peter,
> >>
> >> On 2/27/20 6:00 PM, Peter Xu wrote:
> >>> This is majorly only for X86 because that's the only one that supports
> >>> split irqchip for now.
> >>>
> >>> When the irqchip is split, we face a dilemma that KVM irqfd will be
> >>> enabled, however the slow irqchip is still running in the userspace.
> >>> It means that the resamplefd in the kernel irqfds won't take any
> >>> effect and it can miss to ack INTx interrupts on EOIs.
> >> Won't it always fail to ack INTx? With the above sentence I understand
> >> it can work sometimes?
> > 
> > I wanted to mean that it will fail.  How about s/can/will/?  Or even
> > better wordings that you'd suggest?
> yes: s/can/will
> > 
> >>>
> >>> One example is split irqchip with VFIO INTx, which will break if we
> >>> use the VFIO INTx fast path.
> >>>
> >>> This patch can potentially supports the VFIO fast path again for INTx,
> >>> that the IRQ delivery will still use the fast path, while we don't
> >>> need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
> >>> callers of vfio_eoi() hook).  However the EOI of the INTx will still
> >>> need to be done from the userspace by caching all the resamplefds in
> >>> QEMU and kick properly for IOAPIC EOI broadcast.
> >> If I understand correctly this is a one way fast path? Fast path is on
> >> the trigger side only: VFIO -> KVM but not on the deactivation side,
> >> trapped by the userspace IOAPIC where you directly notify the UNMASK
> >> eventfd from userspace. Is that correct?
> > 
> > Right, the injection is still using the whole fast path.  However
> > AFAIU even for the EOI path it should still be faster than the pure
> > slow path of vfio INTx EIO.  From what I got from reading the code,
> > the slow path will conditionally unmap MMIO regions (with a timer to
> > delay the recovery) so all MMIOs will be slowed down.  For what this
> > patch is doing, it will need to exit to userspace for sure for each
> > EOI (after all IOAPIC is in userspace), however for the whole
> > lifecycle of the device, the MMIO regions should always be mapped so
> > no unwanted MMIO traps.
> Yes the EOI is trapped on IOAPIC side and not at the BAR level. So it
> should be more efficient and more precise.

Yes.

> > 
> >>>
> >>> When the userspace is responsible for the resamplefd kickup, don't
> >>> register it on the kvm_irqfd anymore, because on newer kernels (after
> >>> commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
> >>> irqchip and resamplefd.  This will make sure that the fast path will
> >>> work for all supported kernels.
> >>>
> >>> https://patchwork.kernel.org/patch/10738541/#22609933
> >>>
> >>> Suggested-by: Paolo Bonzini <address@hidden>
> >>> Signed-off-by: Peter Xu <address@hidden>
> >>> ---
> >>> v1.1 changelog:
> >>> - when resamplefd is going to be kicked from userspace, don't register
> >>>   it again in KVM_IRQFD.  Tested against upstream kernel.
> >>>
> >>>  accel/kvm/kvm-all.c    | 74 ++++++++++++++++++++++++++++++++++++++++--
> >>>  accel/kvm/trace-events |  1 +
> >>>  hw/intc/ioapic.c       | 11 +++++--
> >>>  include/sysemu/kvm.h   |  4 +++
> >>>  4 files changed, 86 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> >>> index d49b74512a..b766b6e93c 100644
> >>> --- a/accel/kvm/kvm-all.c
> >>> +++ b/accel/kvm/kvm-all.c
> >>> @@ -159,9 +159,62 @@ static const KVMCapabilityInfo 
> >>> kvm_required_capabilites[] = {
> >>>  static NotifierList kvm_irqchip_change_notifiers =
> >>>      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
> >>>  
> >>> +struct KVMResampleFd {
> >>> +    int gsi;
> >>> +    EventNotifier *resample_event;
> >>> +    QLIST_ENTRY(KVMResampleFd) node;
> >>> +};
> >>> +typedef struct KVMResampleFd KVMResampleFd;
> >>> +
> >>> +/*
> >>> + * Only used with split irqchip where we need to do the resample fd
> >>> + * kick for the kernel from userspace.
> >>> + */
> >>> +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
> >>> +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
> >>> +
> >>>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
> >>>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
> >>>  
> >>> +static inline void kvm_resample_fd_remove(int gsi)
> >>> +{
> >>> +    KVMResampleFd *rfd;
> >>> +
> >>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> >>> +        if (rfd->gsi == gsi) {
> >>> +            QLIST_REMOVE(rfd, node);
> >>> +            break;
> >>> +        }
> >>> +    }
> >>> +}
> >>> +
> >>> +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
> >>> +{
> >>> +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
> >>> +
> >>> +    rfd->gsi = gsi;
> >>> +    rfd->resample_event = event;
> >>> +
> >>> +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
> >>> +}
> >>> +
> >>> +void kvm_resample_fd_notify(int gsi)
> >>> +{
> >>> +    KVMResampleFd *rfd;
> >>> +
> >>> +    if (!kvm_irqchip_is_split()) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> >>> +        if (rfd->gsi == gsi) {
> >>> +            event_notifier_set(rfd->resample_event);
> >>> +            trace_kvm_resample_fd_notify(gsi);
> >>> +            break;
> >>> +        }
> >>> +    }
> >>> +}
> >>> +
> >>>  int kvm_get_max_memslots(void)
> >>>  {
> >>>      KVMState *s = KVM_STATE(current_accel());
> >>> @@ -1642,8 +1695,25 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, 
> >>> EventNotifier *event,
> >>>      };
> >>>  
> >>>      if (rfd != -1) {
> >>> -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> >>> -        irqfd.resamplefd = rfd;
> >>> +        assert(assign);
> >>> +        if (kvm_irqchip_is_split()) {
> >>> +            /*
> >>> +             * When the slow irqchip (e.g. IOAPIC) is in the
> >>> +             * userspace, resamplefd will not work because the EOI of
> >>> +             * the interrupt will be delivered to userspace instead,
> >> s/delivered to userspace/handled in userspace
> > 
> > It will be delivered to userspace by KVM_EXIT_IOAPIC_EOI, so
> > maybe... "delivered and handled"?
> ah ok. TBH I don't really know how the split irqchip works and that may
> explain below misunderstandings.
> > 
> >>> +             * the KVM resample fd kick is skipped.  The userspace
> >>> +             * needs to remember the resamplefd and kick it when we
> >>> +             * receive EOI of this IRQ.
> >> Practically we now talk about a VFIO ACTION_UNMASK classical eventfd
> >> As such isn't it a bit weird to handle those normal UNMASK eventfds in
> >> the KVM code?
> > 
> > I'm not sure I completely get the question, but this should be
> > something general to KVM resamplefd support.  In other words, this
> > should also fix other devices (besides VFIO) when they're using the
> > KVM resamplefd, because IMHO it's the resamplefd and split irqchip
> > which is really broken here.
> Here is my understanding (& memories): the KVM resamplefd is an eventfd
> you register to KVM so that KVM triggers the resamplefd when KVM traps
> the EOI. Here I understand this is the userspace IOAPIC that traps the
> EOI and not the in-kernel virtual interrupt controller. So I would have
> expected you just need to signal the VFIO UNMASK eventfd to re-enable
> the physical IRQ (which was automasked). This is no more a KVM
> resamplefd strictly speaking as KVM is not involved anymore in the
> deactivation process.

Yes KVM kernel side should not be involed when we're using split
irqchip in this case.  However it should still belongs to the work of
the userspace KVM module (kvm-all.c) so that it can still "mimic" the
resamplefd feature that KVM_IRQFD provides.

> > 
> > With that in mind, I think KVM should not need to even know what's
> > behind the resamplefd (in VFIO's case, it's the UNMASK eventfd).  It
> > just needs to kick it when IOAPIC EOI comes for the specific IRQ
> But above the userspace directly calls
> event_notifier_set(rfd->resample_event);
> 
> This is not KVM anymore that "kicks it". Or maybe I miss something. So
> my comment was, why is it handled in the QEMU KVM layer?

It's my fault to be unclear on using "KVM" above.  I should really say
it as kvm-all.c, say, the QEMU layer for the kernel KVM module.

Indeed this problem is complicated... let me try to summarize.

Firstly KVM split irqchip and resamplefd is not really going to work
in the kernel (I think we just overlooked that when introducing the
2nd feature, no matter which one comes first), because the resample
operation should be part of IOAPIC EOI, nevertheless when using split
irqchip IOAPIC is in userspace.

After we noticed this, Alex somewhere proposed to disable that in KVM,
which is actually the 1st kernel patch (654f1f13ea56).

We should (at the same time) propose patch 1 too in this series but I
guess everybody just forgot this afterwards (Paolo actually proposed
mostly the whole solution but I guess it got forgotten too)...

About the fast path speedup: the main logic should be to mimic the
same resamplefd feature as provided by KVM_IRQFD but this time only in
the userspace.  However now we're implementing the same logic only
within userspace kvm-all.c, and the kernel KVM should be totally not
aware of this.  Doing that benefits us in that the KVM interface in
QEMU does not need to be changed (majorly kvm_irqchip_assign_irqfd()).
What we need to do is just to wire up the userspace IOAPIC with these
resamplefds.  And the idea is actually the same too - someone (VFIO)
wants to have one fd (which is the resamplefd) kicked when EOI comes
when requesting for a KVM irqfd, no matter who's going to kick it
(kernel KVM or userspace).  That's all.

> .
> > 
> >>
> >>
> >>> +             */
> >>> +            kvm_resample_fd_insert(virq, resample);
> >>> +        } else {
> >>> +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> >>> +            irqfd.resamplefd = rfd;
> >>> +        }
> >>> +    } else if (!assign) {
> >>> +        if (kvm_irqchip_is_split()) {
> >>> +            kvm_resample_fd_remove(virq);
> >>> +        }
> >>>      }
> >>>  
> >>>      if (!kvm_irqfds_enabled()) {
> >>> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> >>> index 4fb6e59d19..a68eb66534 100644
> >>> --- a/accel/kvm/trace-events
> >>> +++ b/accel/kvm/trace-events
> >>> @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t 
> >>> val, bool assign, uint32_
> >>>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, 
> >>> uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d 
> >>> match: %d"
> >>>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t 
> >>> guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) 
> >>> "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " 
> >>> ret=%d"
> >>>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) 
> >>> "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
> >>> +kvm_resample_fd_notify(int gsi) "gsi %d"
> >>>  
> >>> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> >>> index 15747fe2c2..8c75465c62 100644
> >>> --- a/hw/intc/ioapic.c
> >>> +++ b/hw/intc/ioapic.c
> >>> @@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
> >>>          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
> >>>              entry = s->ioredtbl[n];
> >>>  
> >>> -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> >>> -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != 
> >>> IOAPIC_TRIGGER_LEVEL) {
> >>> +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> >>> +                continue;
> >>> +            }
> >>> +
> >>> +            /* Kick resamplefd if KVM is bypassed */
> >>> +            kvm_resample_fd_notify(n);
> >> KVM is bypassed on the deactivation path but still we call
> >> kvm_resample_fd_notify().
> > 
> > Yes I wanted to say that the kernel won't be able to kick the
> > resamplefd.  How about:
> > 
> >   When IOAPIC is in the userspace (while APIC is still in the kernel),
> >   we need to kick the resamplefd to deactivate the IRQ for KVM.
> This fd "just" aims at unmasking the IRQ at physical level (UNMASK VFIO
> event)? Does it perform anything related to the virtual interrupt
> controller?

It should not.

Thanks,

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]