qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Linux kernel polling for QEMU


From: Christian Borntraeger
Subject: Re: [Qemu-devel] Linux kernel polling for QEMU
Date: Tue, 29 Nov 2016 09:19:22 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

On 11/24/2016 04:12 PM, Stefan Hajnoczi wrote:
> I looked through the socket SO_BUSY_POLL and blk_mq poll support in
> recent Linux kernels with an eye towards integrating the ongoing QEMU
> polling work.  The main missing feature is eventfd polling support which
> I describe below.
> 
> Background
> ----------
> We're experimenting with polling in QEMU so I wondered if there are
> advantages to having the kernel do polling instead of userspace.
> 
> One such advantage has been pointed out by Christian Borntraeger and
> Paolo Bonzini: a userspace thread spins blindly without knowing when it
> is hogging a CPU that other tasks need.  The kernel knows when other
> tasks need to run and can skip polling in that case.
> 
> Power management might also benefit if the kernel was aware of polling
> activity on the system.  That way polling can be controlled by the
> system administrator in a single place.  Perhaps smarter power saving
> choices can also be made by the kernel.
> 
> Another advantage is that the kernel can poll hardware rings (e.g. NIC
> rx rings) whereas QEMU can only poll its own virtual memory (including
> guest RAM).  That means the kernel can bypass interrupts for devices
> that are using kernel drivers.
> 
> State of polling in Linux
> -------------------------
> SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
> calls to spin awaiting new receive packets.  From what I can tell epoll
> is not supported so that system call will sleep without polling.
> 
> blk_mq poll is mainly supported by NVMe.  It is only available with
> synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
> therefore not integrated.  It would be nice to extend the code so a
> process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
> or epoll will poll.
> 
> QEMU and KVM-specific polling
> -----------------------------
> There are a few QEMU/KVM-specific items that require polling support:
> 
> QEMU's event loop aio_notify() mechanism wakes up the event loop from a
> blocking poll(2) or epoll call.  It is used when another thread adds or
> changes an event loop resource (such as scheduling a BH).  There is a
> userspace memory location (ctx->notified) that is written by
> aio_notify() as well as an eventfd that can be signalled.
> 
> kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
> devices use ioeventfd as a doorbell after new requests have been placed
> in a virtqueue, which is a descriptor ring in userspace memory.
> 
> Eventfd polling support could look like this:
> 
>   struct eventfd_poll_info poll_info = {
>       .addr = ...memory location...,
>       .size = sizeof(uint32_t),
>       .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
>       .val  = ...last value...,
>   };
>   ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);
> 
> In the kernel, eventfd stashes this information and eventfd_poll()
> evaluates the operation (e.g. not equal, bitwise and, etc) to detect
> progress.
> 
> Note that this eventfd polling mechanism doesn't actually poll the
> eventfd counter value.  It's useful for situations where the eventfd is
> a doorbell/notification that some object in userspace memory has been
> updated.  So it polls that userspace memory location directly.
> 
> This new eventfd feature also provides a poor man's Linux AIO polling
> support: set the Linux AIO shared ring index as the eventfd polling
> memory location.  This is not as good as true Linux AIO polling support
> where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
> rely on an interrupt to complete I/O requests.
> 
> Thoughts?

Would be an interesting excercise, but we should really try to avoid making
the iothreads more costly. When I look at some of our measurements, I/O-wise
we are  slightly behind z/VM, which can be tuned to be in a similar area but
we use more host CPUs on s390 for the same throughput.

So I have two concerns and both a related to overhead.
a: I am able to get a higher bandwidth and lower host cpu utilization
when running fio for multiple disks when I pin the iothreads to a subset of
the host CPUs (there is a sweet spot). Is the polling maybe just influencing
the scheduler to do the same by making the iothread not doing sleep/wakeup
all the time?
b: what about contention with other guests on the host? What
worries me a bit, is the fact that most performance measurements and
tunings are done for workloads without that. We (including myself) do our
microbenchmarks (or fio runs) with just one guest and are happy if we see
an improvement. But does that reflect real usage? For example have you ever
measured the aio polling with 10 guests or so?
My gut feeling (and obviously I have not done proper measurements myself) is
that we want to stop polling as soon as there is contention.

As you outlined, we already have something in place in the kernel to stop
polling

Interestingly enough, for SO_BUSY_POLL the network code seems to consider
    !need_resched() && !signal_pending(current)
for stopping the poll, which allows to consume your time slice. KVM instead
uses single_task_running() for the halt_poll_thing. This means that KVM 
yields much more aggressively, which is probably the right thing to do for
opportunistic spinning.

Another thing to consider: In the kernel we have already other opportunistic
spinners and we are in the process of making things less aggressive because
it caused real issues. For example search for the  vcpu_is_preempted​ patch set.
Which by the way shown another issue, running nested you do not only want to
consider your own load, but also the load of the hypervisor.

Christian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]