qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] virtio: use blk_io_plug_call() in virtio_irqfd_notify()


From: Stefan Hajnoczi
Subject: Re: [PATCH 1/2] virtio: use blk_io_plug_call() in virtio_irqfd_notify()
Date: Wed, 16 Aug 2023 11:30:10 -0400

On Wed, Aug 16, 2023 at 03:36:32PM +0200, Ilya Maximets wrote:
> On 8/15/23 14:08, Stefan Hajnoczi wrote:
> > virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
> > Buffer Notifications from an IOThread. This involves an eventfd
> > write(2) syscall. Calling this repeatedly when completing multiple I/O
> > requests in a row is wasteful.
> 
> Hi, Stefan.  This is an interesting change!
> 
> There is more or less exactly the same problem with fast network backends
> and I was playing around with similar ideas in this area while working on
> af-xdp network backend recently.  Primarily, implementation of the Rx BH
> for virtio-net device and locking the RCU before passing packets from the
> backend to the device one by one.
> 
> > 
> > Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
> > calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
> > completion processing. Do not modify the thread pool (aio=threads) to
> > avoid introducing a dependency from util/ onto the block layer.
> 
> But you're introducing a dependency from generic virtio code onto the
> block layer in this patch.  This seem to break the module abstraction.
> 
> It looks like there are 2 options to resolve the semantics issue here:

Yes, it's a layering violation.

> 
> 1. Move virtio_notify_irqfd() from virtio.c down to the block layer.
>    Block layer is the only user, so that may be justified, but it
>    doesn't seem like a particularly good solution.  (I'm actually not
>    sure why block devices are the only ones using this function...)

Yes, this is the easiest way to avoid the layering violation for now.

The virtio_notify_irqfd() API is necessary when running in an IOThread
because the normal QEMU irq API must run under the Big QEMU Lock. Block
devices are the only ones that raise interrupts from IOThreads at the
moment.

> 
> 2. Move and rename the block/plug library somewhere generic.  The plug
>    library seems to not have any dependencies on a block layer, other
>    than a name, so it should not be hard to generalize it (well, the
>    naming might be hard).

Yes, it should be possible to make it generic quite easily. I will give
this a try in the next version of the patch.

> In general, while looking at the plug library, it seems to do what is
> typically implemented via RCU frameworks - the delayed function call.
> The only difference is that RCU doesn't check for duplicates and the
> callbacks are global.  Should not be hard to add some new functionality
> to RCU framework in order to address these, e.g. rcu_call_local() for
> calls that should be executed once the current thread exits its own
> critical section.

This rcu_call_local() idea is unrelated to Read Copy Update, so I don't
think it should be part of the RCU API.

Another deferred function call mechanism is QEMUBH. It already supports
coalescing. However, BHs are invoked once per AioContext event loop
iteration and there is no way invoke the BH earlier. Also the BH pointer
needs to be passed to every function that wishes to schedule a deferred
call, which can be tedious (e.g. block/linux-aio.c should defer the
io_submit(2) syscall until the end of virtio-blk request processing -
there are a lot of layers of code between those two components).

> 
> Using RCU for non-RCU-protected things might be considered as an abuse.
> However, we might solve two issues in one shot if instead of entering
> blk_io_plug/unplug section we will enter an RCU critical section and
> call callbacks at the exit.  The first issue is the notification batching
> that this patch is trying to fix, the second is an excessive number of
> thread fences on RCU exits every time virtio_notify_irqfd() and other
> virtio functions are invoked.  The second issue can be avoided by using
> RCU_READ_LOCK_GUARD() in completion functions.  Not sure if that will
> improve performance, but it definitely removes a lot of noise from the
> perf top for network backends.  This makes the code a bit less explicit
> though, the lock guard will definitely need a comment.  Though, the reason
> for blk_io_plug() calls is not fully clear for a module code alone either.

util/aio-posix.c:run_poll_handlers() has a top-level
RCU_READ_LOCK_GUARD() for this reason. Maybe we should do the same
around aio_bh_poll() + aio_dispatch_ready_handlers() in
util/aio-posix.c:aio_poll()? The downside is that latency-sensitive
call_rcu() callbacks perform worse.

> 
> I'm not sure what is the best way forward.  I'm trying to figure out
> that myself, since a similar solution will in the end be needed for
> network backends and so it might be better to have something more generic.
> What do you think?

I suggest moving blk_io_plug() to util/ and renaming it to
defer_begin/defer_end()/defer_call(). The interface and implementation
can also be improved as needed for other users.

> 
> > 
> > Behavior is unchanged for emulated devices that do not use blk_io_plug()
> > since blk_io_plug_call() immediately invokes the callback when called
> > outside a blk_io_plug()/blk_io_unplug() region.
> > 
> > fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
> > single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
> > be noise. Detailed performance data and configuration specifics are
> > available here:
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd
> > 
> > This duplicates the BH that virtio-blk uses for batching. The next
> > commit will remove it.
> 
> I'm likely missing something, but could you explain why it is safe to
> batch unconditionally here?  The current BH code, as you mentioned in
> the second patch, is only batching if EVENT_IDX is not set.
> Maybe worth adding a few words in the commit message for people like
> me, who are a bit rusty on QEMU/virtio internals. :)

The BH is condition on EVENT_IDX not for correctness/safety, but for
performance. Commit 12c1c7d7cefb4dbffeb5712e75a33e4692f0a76b
("virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is
present") argued that VIRTIO's EVENT_IDX feature already provides
batching so the BH is not needed in that case.

Actually I'm not sure exactly why deferring virtio_notify_irqfd() helps,
but IOPS improves on my machine. One theory:

Linux drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_split() (or the
end of the irq handler function) updates EVENT_IDX so that QEMU will
inject an interrupt again when the next request is completed. If the
vCPU and QEMU are running in parallel, then N requests could result in N
interrupts when the vCPU updates EVENT_IDX before QEMU completes the
next request. Maybe that's why I see better performance when deferring
virtio_notify_irqfd().

(I'm thinking of experimenting with rate limiting irq injection so that
it's possible to batch even more efficienctly but I worry that it will
be hard to avoid latency vs throughput regressions.)

Stefan

> 
> Best regards, Ilya Maximets.
> 
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  block/io_uring.c   |  6 ++++++
> >  block/linux-aio.c  |  4 ++++
> >  hw/virtio/virtio.c | 10 +++++++++-
> >  3 files changed, 19 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/io_uring.c b/block/io_uring.c
> > index 69d9820928..749cf83934 100644
> > --- a/block/io_uring.c
> > +++ b/block/io_uring.c
> > @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s)
> >  {
> >      struct io_uring_cqe *cqes;
> >      int total_bytes;
> > +
> > +    blk_io_plug();
> > +
> >      /*
> >       * Request completion callbacks can run the nested event loop.
> >       * Schedule ourselves so the nested event loop will "see" remaining
> > @@ -216,7 +219,10 @@ end:
> >              aio_co_wake(luringcb->co);
> >          }
> >      }
> > +
> >      qemu_bh_cancel(s->completion_bh);
> > +
> > +    blk_io_unplug();
> >  }
> >  
> >  static int ioq_submit(LuringState *s)
> > diff --git a/block/linux-aio.c b/block/linux-aio.c
> > index 561c71a9ae..cef3d6b1c7 100644
> > --- a/block/linux-aio.c
> > +++ b/block/linux-aio.c
> > @@ -204,6 +204,8 @@ static void qemu_laio_process_completions(LinuxAioState 
> > *s)
> >  {
> >      struct io_event *events;
> >  
> > +    blk_io_plug();
> > +
> >      /* Reschedule so nested event loops see currently pending completions 
> > */
> >      qemu_bh_schedule(s->completion_bh);
> >  
> > @@ -230,6 +232,8 @@ static void qemu_laio_process_completions(LinuxAioState 
> > *s)
> >       * own `for` loop.  If we are the last all counters droped to zero. */
> >      s->event_max = 0;
> >      s->event_idx = 0;
> > +
> > +    blk_io_unplug();
> >  }
> >  
> >  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
> > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > index 309038fd46..a691e8526b 100644
> > --- a/hw/virtio/virtio.c
> > +++ b/hw/virtio/virtio.c
> > @@ -28,6 +28,7 @@
> >  #include "hw/virtio/virtio-bus.h"
> >  #include "hw/qdev-properties.h"
> >  #include "hw/virtio/virtio-access.h"
> > +#include "sysemu/block-backend.h"
> >  #include "sysemu/dma.h"
> >  #include "sysemu/runstate.h"
> >  #include "virtio-qmp.h"
> > @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice *vdev, 
> > VirtQueue *vq)
> >      }
> >  }
> >  
> > +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
> > +static void virtio_notify_irqfd_unplug_fn(void *opaque)
> > +{
> > +    EventNotifier *notifier = opaque;
> > +    event_notifier_set(notifier);
> > +}
> > +
> >  void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >  {
> >      WITH_RCU_READ_LOCK_GUARD() {
> > @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, 
> > VirtQueue *vq)
> >       * to an atomic operation.
> >       */
> >      virtio_set_isr(vq->vdev, 0x1);
> > -    event_notifier_set(&vq->guest_notifier);
> > +    blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);
> >  }
> >  
> >  static void virtio_irq(VirtQueue *vq)
> 

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]