[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] instruction optimization thoughts

From: Elefterios Stamatogiannakis
Subject: Re: [Qemu-devel] instruction optimization thoughts
Date: Tue, 24 Aug 2004 14:45:40 +0300
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a3) Gecko/20040727

I don't think the database solution would work. First of all the database would have to be very big in order to be effective. That means that in order to lookup the streams of instructions in the database you would thrash the cache (lots and lots of memory reads to various places). The Magnus's idea is very interesting because it looks like a very simple peephole optimizer. That is why it would be effective for blocks that get executed more than once. For a little more work qemu would have more efficient code to execute. Nevertheless i don't now how much of Magnus idea qemu implements right now. Maybe "Condition code optimisations" + "CPU state optimisations" from ' http://fabrice.bellard.free.fr/qemu/qemu-tech.html ' come very close to what Magnus suggests.

Fabrice or anyone else with qemu internals knowledge would be more qualified to answer.


ps All these code optimizing ideas pale in effectiveness with what a mmu optimization work would produce. There is a reason why there is a qemu-fast and a qemu-soft. If somehow these too could be consolidated then the performance gain would be considerable.... I think.

address@hidden wrote:
I have been thinking along these lines and believe that it could be taken even further. Let's assume that your pre-translation table optimizer could be made to work. As a second step after eliminating redundancy and removing effective NOPs, entries in the intermediate instruction stream could be considered keys into a database of translations. Depending on the amount of memory we want to spend, these keys could be small or even quite large - I imagine that we could go through a process of using something along the lines of a genetic algorithm to build highly optimized translation blocks for the most commonly occurring streams of instructions. The rudimentary building blocks that the genetic algorithm would work with would be all the various translations that a particular instruction stream could result in (e.g. different ordering of order independent instructions, etc..). The tables would, of course, be built off-line (a datafile could be placed in CVS, allowing us to contribute a large amount of upfront compute time to build highly optimized tables for a variety of different platforms).

Thinking along these lines is important. This project not only presents a fantastic emulator, but the "possibility" of eventually reaching par or better performance (if we are smart about it).


On Aug 22, 2004, at 5:01 AM, Magnus Damm wrote:

Hi all,

Today I've been playing around with qemu trying to understand how the
emulation works. I've tried some debug flags and looked at log files.

This is how I believe the translation between x86 opcodes and micro
operations is performed today, please correct me if I am wrong:

gen_intermediate_code_internal() in target-i386/translate.c is used to
build intermediate code. The function disas_insn() is used to convert
each opcode into several micro operations. When the block is finished,
the function optimize_flags() is used to optimize away some flag related
micro operations.

After looking at some log files I wonder if it would be possible to
reduce the number of micro operations (especially the ones involved in
flag handling) by analyzing resources used and set by each x86
instruction and then feed that information into the code that converts
x86 opcodes into micro operations.

Have a look at the following example:

0x300a8b99:  pop    ebx
0x300a8b9a:  add    ebx
0x300a8ba0:  mov    DWORD PTR [ebp-684],eax
0x300a8ba6:  xor    edx,edx
0x300a8ba8:  lea    eax,[ebp-528]
0x300a8bae:  mov    esi,esi
0x300a8bb0:  inc    edx
0x300a8bb1:  mov    DWORD PTR [eax]
0x300a8bb7:  add    eax
0x300a8bba:  cmp    edx
0x300a8bbd:  jbe    0x300a8bb0

If we analyze the x86 instructions and keep track of resources first,
instead of generating the micro operations directly, we would come up
with a table containing resource information related to each x86
instruction. This table contains data about required resources and
resources that will be set by each instruction.

The table could also quite easily be extended to contain flags that mark
if resources are constant or not which leads to further optimization
possibilities later.

instruction        | resources required | resources set

pop ebx            | ESP                |    EBX
add ebx,0x11927    | EBX                |    EBX OF SF ZF AF PF CF
mov ..ebp-684],eax | EBP EAX            | IO
xor edx,edx        | EDX                |    EDX OF SF ZF AF PF CF
lea eax,[ebp-528]  | EBP                |    EAX
mov esi,esi        | ESI                |    ESI

inc edx            | EDX                |    EDX OF SF ZF AF PF
mov ..[eax], 0     | EAX                | IO
add eax, 4         | EAX                |    EAX OF SF ZF AF PF CF
cmp edx, 0x4a      | EDX                |        OF SF ZF AF PF CF
jbe ..             | EIP CF ZF          |    EIP

Then we perform a optimization step. This step removes resources marked
as set that are redundant. Maybe the code for this step could be shared
by many target processors, think of it as some kind of generic resource

After optimization:

instruction        | resources required | resources set

pop ebx            | ESP                |    EBX
add ebx,0x11927    | EBX                |    EBX
mov ..ebp-684],eax | EBP EAX            | IO
xor edx,edx        | EDX                |    EDX
lea eax,[ebp-528]  | EBP                |    EAX
mov esi,esi        | ESI                |    ESI

inc edx            | EDX                |    EDX
mov ..[eax], 0     | EAX                | IO
add eax, 4         | EAX                |    EAX
cmp edx, 0x4a      | EDX                |        OF SF ZF AF PF CF
jbe ..             | EIP CF ZF          |    EIP

Several flag-related resources have been removed above. No other
registers have been removed, but that would also be possible. The
information left in the table is fed into the code that translates the
x86 opcodes into micro operations and it is up to that code to generate
as few micro operations as possible.

I guess what I am trying to say is that it would be cool to add a
generic optimization step before the opcode to micro operations
translation. But would it be useful? Or just slow?

Any thoughts? Maybe the flag handling code is fast enough today?

/ magnus

Qemu-devel mailing list

Qemu-devel mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]