[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Question about direct block chaining
From: |
Alex Bennée |
Subject: |
Re: Question about direct block chaining |
Date: |
Tue, 19 Apr 2022 11:24:22 +0100 |
User-agent: |
mu4e 1.7.13; emacs 28.1.50 |
Taylor Simpson <tsimpson@quicinc.com> writes:
>> -----Original Message-----
>> From: Richard Henderson <richard.henderson@linaro.org>
>> Sent: Monday, April 18, 2022 10:38 AM
>> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
>> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
>> Subject: Re: Question about direct block chaining
>>
>> On 4/18/22 07:54, Taylor Simpson wrote:
>> > I implemented both approaches for inner loops and didn't see speedup
>> > in my benchmark. So, I have a couple of questions
>> > 1) What are the pros and cons of the two approaches
>> (lookup_and_goto_ptr and goto_tb + exit_tb)?
>>
>> goto_tb can only be used within a single page (plus other restrictions, see
>> translator_use_goto_tb). In addition, as documented, the change in cpu
>> state must be constant, beginning with a direct jump.
>>
>> lookup_and_goto_ptr can handle any change in cpu state, including indirect
>> jumps.
>>
>>
>> > 2) How can I verify that direct block chaining is working properly?
>> > With -d exec, I see lines like the following with goto_tb + exit_tb
>> > but
>> NOT lookup_and_goto_ptr
>> > Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40
>> > [0050ac6c]
>>
>> Well, that's one way. I would have also suggested simply looking at -d op
>> output, for the various branchy cases you're considering, to see that all of
>> the
>> exits are as expected.
>
> Thanks!!
>
> I created a synthetic benchmark with a loop with a very small body and a very
> high number of iterations. I can see differences in execution time.
>
> Here are my observations:
> - goto_tb + exit_tb gives the fastest execution time because it will
> patch the native jump address
As we would expect.
> - lookup_and_goto_ptr is an improvement over tcg_gen_exit_tb(NULL, 0)
Yes - mainly saving the cost of prologue and coming out of generated
code to the main loop. However once we get to tb_lookup and fail the
tb_jump_cache its going to take some time to get a block via QHT.
The tb_jump_cache is pretty simple in its implementation but I don't
know if we've ever decently characterised the hit rate and if it could
be improved. I think we already have slightly different hashing
functions for user-mode vs softmmu.
(aside I suspect the trace_vcpu_dstate check can now be removed which
should save a bit of time on the hash function).
--
Alex Bennée