[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v6 73/73] cputlb: queue async flush jobs without
From: |
Richard Henderson |
Subject: |
Re: [Qemu-devel] [PATCH v6 73/73] cputlb: queue async flush jobs without the BQL |
Date: |
Wed, 20 Feb 2019 09:18:33 -0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 |
On 1/29/19 4:48 PM, Emilio G. Cota wrote:
> This yields sizable scalability improvements, as the below results show.
>
> Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)
>
> Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
> "make -j N", where N is the number of cores in the guest.
>
> Speedup vs a single thread (higher is better):
>
> 14 +---------------------------------------------------------------+
> | + + + + + + $$$$$$ + |
> | $$$$$ |
> | $$$$$$ |
> 12 |-+ $A$$ +-|
> | $$ |
> | $$$ |
> 10 |-+ $$ ##D#####################D +-|
> | $$$ #####**B**************** |
> | $$####***** ***** |
> | A$#***** B |
> 8 |-+ $$B** +-|
> | $$** |
> | $** |
> 6 |-+ $$* +-|
> | A** |
> | $B |
> | $ |
> 4 |-+ $* +-|
> | $ |
> | $ |
> 2 |-+ $ +-|
> | $ +cputlb-no-bql $$A$$ |
> | A +per-cpu-lock ##D## |
> | + + + + + + baseline **B** |
> 0 +---------------------------------------------------------------+
> 1 4 8 12 16 20 24 28
> Guest vCPUs
> png: https://imgur.com/zZRvS7q
>
> Some notes:
> - baseline corresponds to the commit before this series
>
> - per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.
>
> - cputlb-no-bql is this commit.
>
> - I'm using taskset to assign cores to threads, favouring locality whenever
> possible but not using SMT. When N=1, I'm using a single host core, which
> leads to superlinear speedups (since with more cores the I/O thread can
> execute
> while vCPU threads sleep). In the future I might use N+1 host cores for N
> guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.
>
> Single-threaded performance is affected very lightly. Results
> below for debian aarch64 bootup+test for the entire series
> on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:
>
> - Before:
>
> Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
>
> 7269.033478 task-clock (msec) # 0.998 CPUs utilized
> ( +- 0.06% )
> 30,659,870,302 cycles # 4.218 GHz
> ( +- 0.06% )
> 54,790,540,051 instructions # 1.79 insns per cycle
> ( +- 0.05% )
> 9,796,441,380 branches # 1347.695 M/sec
> ( +- 0.05% )
> 165,132,201 branch-misses # 1.69% of all branches
> ( +- 0.12% )
>
> 7.287011656 seconds time elapsed
> ( +- 0.10% )
>
> - After:
>
> 7375.924053 task-clock (msec) # 0.998 CPUs utilized
> ( +- 0.13% )
> 31,107,548,846 cycles # 4.217 GHz
> ( +- 0.12% )
> 55,355,668,947 instructions # 1.78 insns per cycle
> ( +- 0.05% )
> 9,929,917,664 branches # 1346.261 M/sec
> ( +- 0.04% )
> 166,547,442 branch-misses # 1.68% of all branches
> ( +- 0.09% )
>
> 7.389068145 seconds time elapsed
> ( +- 0.13% )
>
> That is, a 1.37% slowdown.
>
> Signed-off-by: Emilio G. Cota <address@hidden>
> ---
> accel/tcg/cputlb.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
Reviewed-by: Richard Henderson <address@hidden>
r~