qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] outlined TLB lookup on x86


From: Xin Tong
Subject: Re: [Qemu-devel] outlined TLB lookup on x86
Date: Wed, 27 Nov 2013 19:56:05 -0800




On Wed, Nov 27, 2013 at 6:12 PM, Richard Henderson <address@hidden> wrote:
On 11/27/2013 08:41 PM, Xin Tong wrote:
> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few  questions.
>
> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and
> jumps to the next emulated instruction. I am wondering is it easy to outline
> the code for the slow path.

Hard.  There's quite a bit of code on that slow path that's unique to the
surrounding code context -- which registers contain inputs and outputs, where
to continue after slow path.

The amount of code that's in the TB slow path now is approximately minimal, as
far as I can see.  If you've got an idea for improvement, please share.  ;-)


> I am thinking when a TLB misses, the outlined TLB
> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical
> path, so its not as important as the code when TLB hits.

That would work for true TLB misses to RAM, but does not work for memory mapped
I/O.

> 2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the
> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.

I'd be interested to experiment with different TLB sizes, to see what effect
that has on performance.  But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.
 
Hardware TLBs are limited in size primarily due to the fact that increasing their sizes increases their access latency as well. but software tlb does not suffer from that problem. so i think the size of the softtlb should be not influenced by the size of the hardware tlb. 

Flushing the TLB is minimal unless we have a really really large TLB, e.g. a TLB with 1M entries. I vaguely remember that i see ~8% of the time is spent in the cpu_x86_mmu_fault function in one of the speccpu2006 workload some time ago. so if we increase the size of the TLB significantly and potential getting rid of most of the TLB misses, we can get rid of most of the 8%. ( there are still compulsory misses and a few conflict misses, but i think compulsory misses is not the major player here).

But be aware that we can't simply make the change universally.  E.g. ARM can
use an immediate 8-bit operand during the TLB lookup, but would have to use
several insns to perform a 9-bit mask.
 
This can be handled with ifndefs. most of the tlb code common to all cpus need not be changed. 

>  I am
> thinking the tlb may need to be organized in a set associative fashion to
> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
> have a victim tlb that is 4 way associative and use x86 simd instructions to do
> the lookup once the direct-mapped tlb misses. Has anybody done any work on this
> front ?

Even with SIMD, I don't believe you could make the fast-path of a set
associative lookup fast.  This is the sort of thing for which you really need
the dedicated hardware of the real TLB.  Feel free to prove me wrong with code,
of course.

I am thinking the primary TLB should remain what it is, i.e. a direct mapped. but we can have a victim TLB with bigger assoiciatity. the victim TLB can be walked either sequentially or in parallel with simd instructions. its going to be slower than hitting in the direct mapped TLB. but (much) better than having to rewalk the page table. 

For OS with ASID, we can have a shared TLB with ASID. and this can potentially get rid of some compulsory misses for us. e.g. multiple threads in a process share the same ASID and we can check for the shared TLB on miss before walk the page table.


r~


reply via email to

[Prev in Thread] Current Thread [Next in Thread]