qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Question about direct block chaining


From: Alex Bennée
Subject: Re: Question about direct block chaining
Date: Tue, 19 Apr 2022 11:24:22 +0100
User-agent: mu4e 1.7.13; emacs 28.1.50

Taylor Simpson <tsimpson@quicinc.com> writes:

>> -----Original Message-----
>> From: Richard Henderson <richard.henderson@linaro.org>
>> Sent: Monday, April 18, 2022 10:38 AM
>> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
>> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
>> Subject: Re: Question about direct block chaining
>> 
>> On 4/18/22 07:54, Taylor Simpson wrote:
>> > I implemented both approaches for inner loops and didn't see speedup
>> > in my benchmark.  So, I have a couple of questions
>> > 1) What are the pros and cons of the two approaches
>> (lookup_and_goto_ptr and goto_tb + exit_tb)?
>> 
>> goto_tb can only be used within a single page (plus other restrictions, see
>> translator_use_goto_tb).  In addition, as documented, the change in cpu
>> state must be constant, beginning with a direct jump.
>> 
>> lookup_and_goto_ptr can handle any change in cpu state, including indirect
>> jumps.
>> 
>> 
>> > 2) How can I verify that direct block chaining is working properly?
>> >        With -d exec, I see lines like the following with goto_tb + exit_tb 
>> > but
>> NOT lookup_and_goto_ptr
>> >        Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40
>> > [0050ac6c]
>> 
>> Well, that's one way.  I would have also suggested simply looking at -d op
>> output, for the various branchy cases you're considering, to see that all of 
>> the
>> exits are as expected.
>
> Thanks!!
>
> I created a synthetic benchmark with a loop with a very small body and a very 
> high number of iterations.  I can see differences in execution time.
>
> Here are my observations:
> - goto_tb + exit_tb gives the fastest execution time because it will
> patch the native jump address

As we would expect.

> - lookup_and_goto_ptr is an improvement over tcg_gen_exit_tb(NULL, 0)

Yes - mainly saving the cost of prologue and coming out of generated
code to the main loop. However once we get to tb_lookup and fail the
tb_jump_cache its going to take some time to get a block via QHT.

The tb_jump_cache is pretty simple in its implementation but I don't
know if we've ever decently characterised the hit rate and if it could
be improved. I think we already have slightly different hashing
functions for user-mode vs softmmu.

(aside I suspect the trace_vcpu_dstate check can now be removed which
should save a bit of time on the hash function).

-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]