qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Suggestions for TCG performance improvements


From: Vasilev Oleg
Subject: Re: Suggestions for TCG performance improvements
Date: Fri, 3 Dec 2021 16:21:08 +0000

On 12/2/2021 7:02 PM, Alex Bennée wrote:

> Vasilev Oleg <vasilev.oleg@huawei.com> writes:
>
>> I've discovered some MMU-related suggestions in the 2018 letter[2], and
>> those seem to be still not implemented (flush still uses memset[3]).
>> Do you think we should go forward with implementing those?
> I doubt you can do better than memset which should be the most optimised
> memory clear for the platform. We could consider a separate thread to
> proactively allocate and clear new TLBs so we don't have to do it at
> flush time. However we wouldn't have complete information about what
> size we want the new table to be.
>
> When a TLB flush is performed it could be that the majority of the old
> table is still perfectly valid. 

In that case, do you think it would be possible instead of flushing TLBs, store 
it somewhere and bring it back when the address space changes back? 

> However we would need a reliable mechanism to work out which entries in the 
> table could be kept. 

We could invalidate entries in those stored TLBs the same way we invalidate the 
active TLB. If we are going to have new thread to manage TLB allocation, 
invalidation could also be offloaded to those.

> I did ponder a debug mode which would keep the last N tables dropped by
> tlb_mmu_resize_locked and then measure the differences in the entries
> before submitting the free to an rcu tasks.
>> The mentioned paper[4] also describes other possible improvements.
>> Some of those are already implemented (such as victim TLB and dynamic
>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>> set-associative TLB layer). Do you think those improvements
>> worth trying?
> Anything is worth trying but you would need hard numbers. Also its all
> too easy to target micro benchmarks which might not show much difference
> in real world use. 

The  mentioned paper presents some benchmarking, e. g. linux kernel compilation 
and some other stuff. Do you think those shouldn't be trusted?

> The best thing you can do at the moment is give the
> guest plenty of RAM so page updates are limited because the guest OS
> doesn't have to swap RAM around.
>
> Another optimisation would be looking at bigger page sizes. For example
> the kernel (in a Linux setup) usually has a contiguous flat map for
> kernel space. If we could represent that at a larger granularity then
> not only could we make the page lookup tighter for kernel mode we could
> also achieve things like cross-page TB chaining for kernel functions.

Do I understand correctly that currently softmmu doesn't treat hugepages any 
special, and you are suggesting we add such support, so that a particular 
region of memory occupies less TLBentries? This probably means TLB lookup would 
become quite a bit more complex.

>> Another idea for decreasing occurence of TLB refills is to make TBs key
>> in htable independent of physical address. I assume it is only needed
>> to distinguish different processes where VAs can be the same.
>> Is that assumption correct?

This one, what do you think? Can we replace physical address as part of a key 
in TB htable with some sort of address space identifier?

>> Do you have any other ideas which parts of TCG could require our
>> attention w.r.t the flamegraph I attached?
> It's been done before but not via upstream patches but improving code
> generation for hot loops would be a potential performance win. 

I am not sure optimizing the code generation itself would help much, at least 
in our case. The flamegraph I attached to previous letter shows that only about 
10% of time qemu spends in generated code. The rest is helpers, searching for 
next block, TLB-related stuff and so on.

> That would require some changes to the translation model to allow for
> multiple exit points and probably introducing a new code generator
> (gccjit or llvm) to generate highly optimised code.

This, however, could bring a lot of performance gain, translation blocks would 
become bigger, and we would spend less time searching for the next block.

>> I am also CCing my teammates. We are eager to improve the QEMU TCG
>> performance for our needs and to contribute our patches to upstream.
> Do you have any particular goal in mind or just "better"? The current
> MTTCG scaling tends to drop off as we go above 10-12 vCPUs due to the
> cost of synchronous flushing across all those vCPUs.

We have some internal ways to measure performance, but we are looking for 
alternative metric, that we could share and you could reproduce. Sysbench in 
threads mode is the closed we have found so far by comparing flamegraphs, but 
we are testing more benchmarking software.

>> [1]: https://github.com/akopytov/sysbench
>> [2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg562103.html
>> [3]: 
>> https://github.com/qemu/qemu/blob/14d02cfbe4adaeebe7cb833a8cc71191352cf03b/accel/tcg/cputlb.c#L239
>> [4]: https://dl.acm.org/doi/pdf/10.1145/2686034
>>
>> [2. flamegraph.svg --- image/svg+xml; flamegraph.svg]...
>>
>> [3. callgraph.svg --- image/svg+xml; callgraph.svg]...
>>
Thanks,
Oleg




reply via email to

[Prev in Thread] Current Thread [Next in Thread]