qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Wed, 6 Aug 2014 10:48:55 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> Hi Kevin,
> 
> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <address@hidden> wrote:
> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
> >> I have been wondering how to prove that the root cause is the ucontext
> >> coroutine mechanism (stack switching).  Here is an idea:
> >>
> >> Hack your "bypass" code path to run the request inside a coroutine.
> >> That way you can compare "bypass without coroutine" against "bypass with
> >> coroutine".
> >>
> >> Right now I think there are doubts because the bypass code path is
> >> indeed a different (and not 100% correct) code path.  So this approach
> >> might prove that the coroutines are adding the overhead and not
> >> something that you bypassed.
> >
> > My doubts aren't only that the overhead might not come from the
> > coroutines, but also whether any coroutine-related overhead is really
> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> > just that instead of introducing additional code paths.
> 
> OK, thank you for taking look at the problem, and hope we can
> figure out the root cause, :-)
> 
> >
> > Another thought I had was this: If the performance difference is indeed
> > only coroutines, then that is completely inside the block layer and we
> > don't actually need a VM to test it. We could instead have something
> > like a simple qemu-img based benchmark and should be observing the same.
> 
> Even it is simpler to run a coroutine-only benchmark, and I just
> wrote a raw one, and looks coroutine does decrease performance
> a lot, please see the attachment patch, and thanks for your template
> to help me add the 'co_bench' command in qemu-img.

Yes, we can look at coroutines microbenchmarks in isolation. I actually
did do that yesterday with the yield test from tests/test-coroutine.c.
And in fact profiling immediately showed something to optimise:
pthread_getspecific() was quite high, replacing it by __thread on
systems where it works is more efficient and helped the numbers a bit.
Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
in qemu-img bench), maybe there's even something that can be done here.

However, I just wasn't sure whether a change on this level would be
relevant in a realistic environment. This is the reason why I wanted to
get a benchmark involving the block layer and some I/O.

> From the profiling data in below link:
> 
>     http://pastebin.com/YwH2uwbq
> 
> With coroutine, the running time for same loading is increased
> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> 
> The bypass code in the benchmark is very similar with the approach
> used in the bypass patch, since linux-aio with O_DIRECT seldom
> blocks in the the kernel I/O path.
> 
> Maybe the benchmark is a bit extremely, but given modern storage
> device may reach millions of IOPS, and it is very easy to slow down
> the I/O by coroutine.

I think in order to optimise coroutines, such benchmarks are fair game.
It's just not guaranteed that the effects are exactly the same on real
workloads, so we should take the results with a grain of salt.

Anyhow, the coroutine version of your benchmark is buggy, it leaks all
coroutines instead of exiting them, so it can't make any use of the
coroutine pool. On my laptop, I get this (where fixed coroutine is a
version that simply removes the yield at the end):

                | bypass        | fixed coro    | buggy coro
----------------+---------------+---------------+--------------
time            | 1.09s         | 1.10s         | 1.62s
L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
insns per cycle | 2.39          | 2.39          | 1.90

Begs the question whether you see a similar effect on a real qemu and
the coroutine pool is still not big enough? With correct use of
coroutines, the difference seems to be barely measurable even without
any I/O involved.

> > I played a bit with the following, I hope it's not too naive. I couldn't
> > see a difference with your patches, but at least one reason for this is
> > probably that my laptop SSD isn't fast enough to make the CPU the
> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
> > thing. (I actually wrote the patch up just for some profiling on my own,
> > not for comparing throughput, but it should be usable for that as well.)
> 
> This might not be good for the test since it is basically a sequential
> read test, which can be optimized a lot by kernel. And I always use
> randread benchmark.

Yes, I shortly pondered whether I should implement random offsets
instead. But then I realised that a quicker kernel operation would only
help the benchmark because we want it to test the CPU consumption in
userspace. So the faster the kernel gets, the better for us, because it
should make the impact of coroutines bigger.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]