[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QM
From: |
Filip Navara |
Subject: |
Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) |
Date: |
Mon, 29 Jun 2009 01:19:49 +0200 |
On Sun, Jun 28, 2009 at 11:24 PM, Laurent
Desnogues<address@hidden> wrote:
> On Sun, Jun 28, 2009 at 8:19 PM, Filip Navara<address@hidden> wrote:
>> Doing a profiling run on several ARM demo programs showed that most of
>> the generated code was doing load/store operations to the machine
>> registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
>> counts):
>>
>> movi_i32 1603
>> ld_i32 1305
>> st_i32 1174
>> add_i32 530
>> ...
>>
>> If there could be done something that would allow the guest registers
>> to be stored in host registers, even if for a temporary amount of time
>> it would certainly help the guests that I'm dealing with.
>
> TCG does a good job for register allocation.
>
> The problem you have here is that the ARM translator
> isn't using tcg_global_mem_new_i32 for ARM registers.
Interesting, thanks for the tip. I have been trying to achieve the
same effect using tcg_global_reg_new_i32, no wonder it felt so hard.
:)
> Here's an example of number of ops I see when using
> tcg_global_mem_new_i32:
>
> exit_tb 4991
> add_i32 7945
> st_i32 8257
> movi_i32 26812
> mov_i32 38369
>
> And with the trunk:
>
> exit_tb 4957
> add_i32 8165
> st_i32 20281
> ld_i32 21926
> movi_i32 25083
>
>
> Laurent
>
Attached is a proof-of-concept of ARM patch for using
tcg_global_mem_new_i32. I didn't have much time to test it yet, but on
synthetic benchmark it improved the performance by 13 DMIPS to the
total of 216 DMIPS, which equals to 6% improvement. On x86 host the
register allocation still looks very pathetic, I will post a follow-up
soon.
Best regards,
Filip Navara
0001-First-try-at-using-tcg_global_mem_new_i32.patch.txt
Description: Text document