Re: [Qemu-devel] i386 emulation: improved flag handing

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] i386 emulation: improved flag handing

From:	Magnus Damm
Subject:	Re: [Qemu-devel] i386 emulation: improved flag handing
Date:	Sun, 29 Aug 2004 16:16:47 +0200

Hello again,

Yes, your idea sounds like a simple and efficient solution for the
inc/dec problem! I will play around a bit with my idea and see how the
code evolves... I thought that qemu always calculated all flags for each
conditional branch, but it seems that that code is fast today so there
is probably no need to optimize that part. And another thing might be
that what is efficient for me on PowerPC might not be the best thing for
the poor bastards stuck with x86 hardware. =)

Anyway, I also believe that the CC_OP solution today with the delayed
micro code insertion makes it very complicated to add a optimization
layer on top of it (post- or pre- translation) and I have a feeling that
adding usage/constant register/flag tracing would allow us to reach even
better performance.

But kernel support will probably boost the performance even more!
Good luck with that, please let us know how things are going.

Thanks!

/ magnus

On Sun, 2004-08-29 at 14:58, Fabrice Bellard wrote:
> Hi,
> 
> The current QEMU eflags handling is not efficient for inc/dec as it must 
> recompute the C flag which is not modified by inc/dec. I think this is 
> the most important slowdown due to eflags handling. A simple solution 
> would just be to save CC_OP/CC_SRC/CC_DST instead of computing 'CF'. A 
> test is still needed if an inc/dec is followed by inc/dec to avoid 
> saving CC_OP/CC_SRC/CC_DST again.
> 
> So the eflags state would be:
> CC_OP
> CC_SRC
> CC_DST
> CC_OP_C
> CC_DST_C
> 
> if CC_OP == CC_OP_INC/DEC then all eflags except C are computed from 
> CC_SRC. 'CF' is computed from CC_OP_C, CC_DST_C and CC_SRC (CC_OP_C must 
> never be CC_OP_INC/DEC).
> 
> Your solution seems a little too complicated for the expected gain. Try 
> to compare it with my proposal.
> 
> Just for your information, my next developments will consist in 
> improving QEMU performance in the x86 on x86 case to match (or exceed 
> :-)) the VMware or VirtualPC level of performance. The downside is that 
> some kernel support will be needed. The kernel support will of course 
> remain optional. This mode of operation will replace 'qemu-fast'.
> 
> For the x86 on PowerPC case, better usage of the host registers would 
> give a performance boost. In particular, CC_SRC and CC_DST should be 
> saved in host registers too.
> 
> Fabrice.
> 
> Magnus Damm wrote:
> > Hi all,
> > 
> > Here is something that I've been thinking about the last week. I hope it
> > can lead to improved performance.
> > 
> > / magnus
> > 
> > 
> > The flag emulation code today:
> > -------------------------------
> > 
> > The implementation today is rather straightforward and simple:
> > 
> > 1. Each emulated instruction that modifies any flag will update up to
> > three variables containing instruction type (CC_OP), source value
> > (CC_SRC) and destination value (CC_DST). If the instruction not modifies
> > all flags, the previous flags are calculated - hopefully only the carry
> > flag.
> > 
> > 2. When a instruction depends on a flag, all flags (or just the carry
> > flag) are calculated from the stored information.
> > 
> > 3. During the opcode to micro operations translation, the last type of
> > flag instruction (CC_OP) is kept track of and only written if necessary.
> > 
> > 4. After the translation between the i386 opcodes and the micro
> > operations has taken place, a optimization step takes place and replaces
> > micro operations that are redundant with NOPs.
> > 
> > 
> > Improved flag handling - a more fine grained approach:
> > ------------------------------------------------------
> > 
> > By looking at the "status flag summary" in my 486 book I understand that
> > there are basically three groups of x86 instructions that modify flags.
> > Note that this does not include rare single-flag modifying instructions.
> > 
> >    OF SF ZF AF PF CF
> > A  x  x  x  x  x  x   
> > B  x  x  x  x  x      
> > C  x              x   
> > 
> > Say hello to group A, group B and group C. Group A contains the most
> > common flag operations, group B is basically INC and DEC while group C
> > contains various shift instructions.
> > 
> > Each group is kept track of with two variables, CC_SRC_<group> and
> > CC_DST_<group>. The current value of the EFLAGS register is stored in a
> > variable called CC_EFLAGS. A 32 bit variable, CC_CACHE is used to store
> > the state of each flag. Six tables, one for each flag (cc_table_<flag>)
> > are used to lookup flag calculating functions.
> > 
> > 
> > CC_CACHE format:
> > 
> > 12 bits flag state      18 bits group info
> > 
> > OF SF ZF AF PF CF       A      B      C 
> >                         
> > NN NN NN NN NN NN       NNNNNN NNNNNN NNNNNN
> > 
> > Each flag has a two bit field indicating the state:
> > 
> > 0 -> flag is up to date, no need to flush cache.
> > 1 -> flag was last modified by group A
> > 2 -> flag was last modified by group B
> > 3 -> flag was last modified by group C
> > 
> > 
> > When an instruction that belongs to group A is translated into micro
> > operations, the last micro operation will perform up to three variable
> > writes:
> > 
> > 1. CC_CACHE is written with all flags states set to 1 (indicating the
> > flag belongs to group A) and group info A field is set to the
> > instruction number (compare with CC_OP today). This is a single 32 bit
> > write.
> > 
> > 2. CC_DST_A is set in the same way as CC_DST today.
> > 
> > 3. If required, CC_SRC_A is set too.
> > 
> > When a group B or C instruction is translated, the last micro operation
> > will perform:
> > 
> > 1. CC_CACHE is modified (read-modify-write) to update the flags and
> > group info field B or C. For group B, all flags except CF are set to 2
> > (indicating group B). For group C, the OF and CF fields are set to 3
> > indicating group C.
> > 
> > 2. For group B CC_DST_B is written, for group C CC_DST_C is written.
> > 
> > 3. If required CC_SRC_B or CC_SRC_C is written.
> > 
> > Because group A instructions are the most common ones, the group A
> > implementation is faster (no read-modify-write) than group B and C.
> > 
> > 
> > Question: What happens when an instruction needs to test one or more
> > flags? Answer: Before the flag can be used to calculate anything micro
> > operations that flush the state of each flag must be performed. One
> > micro operation per flag. The post-translation optimization step could
> > probably change more than N flag flush micro operations into one micro
> > operation flushing all flags if that would be more efficient.
> > 
> > When the cache of one flag is flushed, the corresponding flag state
> > field in CC_CACHE is read out and used as a index into cc_group_<flag>
> > to point out the function used to flush the flag.
> > 
> > cc_group_<flag>[0] will all point to a function that just returns,
> > remember that a flag state of 0 means that the flag is up to date.
> > The other functions will calculate the flag based on CC_DST_<group> and
> > CC_SRC_<group>, store the result in CC_EFLAGS and then mark the flag
> > state in CC_CACHE as 0 to indicate that the flag now is up to date.
> > The actual implementation of the flag calculation code will of course
> > vary, for some flags the code could be shared between all instruction
> > types in one group. Example: ZF and PF are probably handled in the same
> > way for all group A instructions. Other flags will probably need a
> > second look up dealing with the instruction type.
> > 
> > 
> > So, what my improved flag handling scheme basically does is to divide
> > the load of calculating the flags into a several small pieces. Only the
> > flags required by an instruction must be flushed. I hope that some
> > cycles could be saved by not calculating all flags. The downside is of
> > course that it will be less efficient to update all flags compared with
> > the implementation today. And that it is less efficient to modify group
> > B/C (read-modify-write) and store CC_DST_B/C + CC_SRC_B/C, than just
> > store CC_OP, CC_DST and CC_SRC like today.
> > 
> > A good thing though is that it is always possible to set any flag in the
> > EFLAGS register without recalculating any other flags. And, of course, I
> > feel that it would be easier to add more advanced optimization code
> > later on...
> > 
> > Should I start hacking on a patch? Or would it be a waste of time?
> > Please let me know what you think. Thanks!
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Qemu-devel mailing list
> > address@hidden
> > http://lists.nongnu.org/mailman/listinfo/qemu-devel
> > 
> > 
> 
> 
> 
> 
> _______________________________________________
> Qemu-devel mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/qemu-devel

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] i386 emulation: improved flag handing, Magnus Damm, 2004/08/28
- Re: [Qemu-devel] i386 emulation: improved flag handing, Fabrice Bellard, 2004/08/29
  - Re: [Qemu-devel] i386 emulation: improved flag handing, Magnus Damm <=
  - Re: [Qemu-devel] i386 emulation: improved flag handing, Gianni Tedesco, 2004/08/29
    - Re: [Qemu-devel] i386 emulation: improved flag handing, Fabrice Bellard, 2004/08/30
    - Re: [Qemu-devel] i386 emulation: improved flag handing, John R. Hogerhuis, 2004/08/30

Prev by Date: Re: Cloop-Driver, was Re: [Qemu-devel] QEMU with KNOPPIX
Next by Date: Re: [Qemu-devel] sparc?
Previous by thread: Re: [Qemu-devel] i386 emulation: improved flag handing
Next by thread: Re: [Qemu-devel] i386 emulation: improved flag handing
Index(es):
- Date
- Thread