qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 00/20] Do away with TB retranslation


From: Aurelien Jarno
Subject: Re: [Qemu-devel] [RFC 00/20] Do away with TB retranslation
Date: Sun, 13 Sep 2015 23:00:53 +0200
User-agent: Mutt/1.5.23 (2014-03-12)

On 2015-09-10 19:48, Aurelien Jarno wrote:
> On 2015-09-01 22:51, Richard Henderson wrote:
> > I've been looking at this problem off and on for the last week or so,
> > prompted by the sparc performance work.  Although I havn't been able
> > to get a proper sparc64 guest install working, I see the exact same
> > problem with a mips guest.
> > 
> > On alpha or x86, which seem to perform well, perf numbers for the
> > executable have about 30% of the execution time spent in cpu_exec.
> > For mips, on the other hand, we spend about 30% of the time in
> > routines related to tcg (re-)translation.
> 
> Indeed the problem happens on CPUs which implement the MMU as a 
> "software assisted TLB" (or any other marketing name), as opposed to
> hardware page walk MMU. They can hold a limited number of TLB entry
> at a given time, and require the OS to do the page walk to refill the
> TLB. For that an exception is generated, and the faulting address has
> to be determined. That's were the TB retranslation takes place, and
> that's why it happens a lot more on these CPUS.
> 
> A few years ago, I measured about 45% of the TB translation actually
> being retranslation for mips and 60% for SH4 for a standard workload.
> For a comparison, these value around 1% on i386 and around 5% on ARM.
> 
> That's why each time we add an optimization to the optimize, we get
> faster code, but we might loose because it takes longer to generate.
> 
> > Aurelien has a patch in his own branches that attempts to mitigate this
> > on mips by shadow caching more tlb entries.  While this does improve
> > performace a bit, it employs a linear search through a large buffer,
> > with the effect of 30-ish % perf numbers for r4k_map_address.
> > (One could probably improve things by hashing the data in that array,
> > rather than a linear search, but...)
> 
> Yes, that is just a workaround and probably highly workload dependent,
> that's why I never submitted it.
> 
> > In the past we've talked about getting rid of retranslation entirely.
> > It's clever, but it certainly has its share of problems.  I gave it
> > a go this weekend.
> 
> Really great that you have been able to implement that.
> 
> > The following isn't quite right.  It fails to boot on sparc even with
> > our tiny test kernel.  It also triggers an abort on mips, eventually.
> > But it's able to get all the way through to a prompt, and in the 
> > process I can see that perf results are quite different -- much more
> > like results I see for alpha.
> > 
> > Thoughts on the approach?
> 
> It looks like the approach we discussed with Paolo back in June:
> 
> http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg04885.html
> 
> For me it looks like the good way to proceed, we just have to take care
> that the informations to store do not take too much space compared to 
> the actual translated code.
> 
> I'll give a look and a test asap.

I haven't really reviewed the code yet, but I have been able to test
your tcg-search-2 branch.

First of all I have tested half of the targets (alpha, arm, cris, i386,
mips, ppc, s390x, sh4 and sparc), and I haven't noticed any regression.
They now have more than 50 hours of uptime, some of them have been 
building stuff most of the time, so they are quite stable. That said
I have only tested your branch on an x86-64 host, and it might be a 
good idea to test it in one or two different host architectures (I put
that on my todo list, but no promise there).

On the performance side, I have done real measurements only on i386 and
mips. On i386, I haven't seen any measurable difference. On mips, the
boot time is unchanged, but then some workloads are quite faster. The
best I have measured is on perl code, with a x2.4 improvements, while
on an average workload, the gain is around x1.5.

With all that said, you can get:

  Tested-by: Aurelien Jarno <address@hidden>

I hope to give you the corresponding reviewed-by in the next days.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
address@hidden                 http://www.aurel32.net



reply via email to

[Prev in Thread] Current Thread [Next in Thread]