[Qemu-devel] Possible ppc comparision optimisation

From: Torbjorn Granlund
Date: Wed, 08 May 2013 00:56:54 +0200
The current ppc gen_op_cmp generates a long sequence of instructions,
using a plain series of three disjoint compares.

It is possible to compute the 3 result bits more cleverly.  Below is a
possible replacement gen_op_cmp.  (It is tested by booting GNU/Linux
ppx64, but not much more than that.)

Surely this should be faster than the old code?  OK, it is less
readable, but cmp is pretty critical and should be made fast.

Should one truncate things using tcg_gen_trunc_tl_i32 and do the add,
xori, addi as i32 variants?  (Why?)

There could be a disadvantage of this compared to the old code, since
this has a chained algebraic dependency, while the old code's many
instructions might have been more independent.

static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
    TCGv t0 = tcg_temp_new();
    TCGv t1 = tcg_temp_new();
    TCGv_i32 s0 = tcg_temp_new_i32();

    tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);

    tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1);
    tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1);
    tcg_gen_add_tl(t0, t0, t1);
    tcg_gen_xori_tl(t0, t0, 1);
    tcg_gen_addi_tl(t0, t0, 1);
    tcg_gen_trunc_tl_i32(s0, t0);
    tcg_gen_shli_i32(s0, s0, 1);
    tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0);



