[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] ideas for improving TLB performance (help with TCG back
Re: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)
Thu, 20 Sep 2018 01:19:51 +0100
mu4e 1.1.0; emacs 26.1.50
Emilio G. Cota <address@hidden> writes:
> I've been thinking about ways to increase softmmu performance
> by speeding up TLB accesses.
> Last year, Pranith proposed to increase the size of the TLBs:
> The problem with that approach is that it slows down flushes
> significantly, since they have to memset(-1) large amounts
> of memory. And flushes can be very frequent, e.g. during
> This paper quantifies this issue (with SPEC06 but also a "kernel
> boot" workload), and proposes a way to avoid it:
> "Optimizing Memory Translation Emulation in Full System Emulators"
> Xin Tong, Toshihiko Koju, and Motohiro Kawahito
> The ACM version is behind a paywall, this other one is not:
> The idea is to allocate a new TLB on a flush, thereby
> removing the need for memset at flush time (the paper assumes
> that the allocation+memset has previously been done, possibly in
> another thread).
> I like the idea of allocating a new TLB, since:
> - This will work with MTTCG; we'd reclaim the old array with RCU,
> which is OK because CPUs always execute under an RCU critical section.
> - The lookup "fast path" would take a hit due to executing an extra
> instruction, but as the paper shows the corresponding impact is
> very small compared to the benefits of having a larger TLB.
If we are going to have an indirection then we can also drop the
requirement to scale the TLB according to the number of MMU indexes we
have to support. It's fairly wasteful when a bunch of them are almost
never used unless you are running stuff that uses them.
> An additional improvement that I have thought of is to get rid
> of memset(-1) altogether. Instead, we'd store addresses in the TLB
> as $real_address+1, so that 0xff..ff is stored as 0x00..00. That way,
> instead of malloc+memset we'd just calloc a new TLB, which
> should be much faster since we'd most likely get zeroed pages
> from mmap. The cost would be an additional instruction in the fast
> path to subtract 1 from the address in the TLB, but this extra
> instruction would be essentially free in modern CPUs.
Or test for 0 - I'm guessing pretty much any null page access could be
an always slow path as it's likely to be a fault.
> I have looked into implementing this approach but it would take me
> a long time to get proficient enough to generate the code I want from
> the i386 TCG backend.
I think implementing the out-of-line lookup would be a good first step
> If someone could help with that, I could take care of the rest, i.e.
> changes to C code and measuring the perf impact. If we got good
> results, we could then look into implementing this for all TCG
> BTW the paper also has other interesting ideas, for example
> "uninlining" TLB lookups, which they claim increases performance
> by 6%. I also looked into this but I fail to see how this could
> ever be maintainable, since we'd have to generate many
> subroutines, one for each combination of generation-time
> parameters that tcg_out_tlb_load takes.