qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 3/3] tcg-alpha: New TCG target.


From: Richard Henderson
Subject: Re: [Qemu-devel] [PATCH 3/3] tcg-alpha: New TCG target.
Date: Tue, 23 Oct 2012 09:14:33 +1000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121016 Thunderbird/16.0.1

On 2012-10-23 07:39, Aurelien Jarno wrote:
>> +    alpha)
>> +           QEMU_CFLAGS="-msmall-data $QEMU_CFLAGS"
>> +           LDFLAGS="-Wl,--warn-multiple-gp $LDFLAGS"
>> +           host_guest_base="yes"
>> +           ;;
> 
> The host_guest_base part should now be removed, but it's due to me
> taking time to review the patch.

Noted.

>> +    case 'K':
>> +        /* Constraint 'K' means an immediate -255..255.  */
>> +        ct->ct |= TCG_CT_CONST_PN255;
> 
> I9 might be easier to read/remember.

I9 would (to me at least) imply -256 as well.

> Does these name correspond to something beside the alphabet order? They
> might be difficult to remember, and something like 'Z' for zero
> register or 'U' for unsigned constant and 'I' for signed constant might
> be better.

They're the same letters as gcc uses for its constraints.  That makes it
easier to remember, I would think.

>> +        assert(ra != TMP_REG1);
>> +        rs = TMP_REG1;
>> +        break;
>> +    }
> 
> Are there cases where the scratch register can't be TMP_REG1 or
> TMP_REG2? This part of the code seems fragile, despite the assert().

Not so far.

I can't think of any better solutions except to have the generic tcg code
allow a callback to allocate a new register.  Which would really be generally
useful, allowing dedicated "temp registers" to be dropped from all backends.

>> +    /* See if we can turn a large absolute address into an offset from $gp.
>> +       Note that we assert via -msmall-data and --warn-multiple-gp that
>> +       the $gp value is constant everywhere.  Which means that the 
>> translated
>> +       code shares the same value as we have loaded right now.  */
>> +    if (rb == TCG_REG_ZERO && orig != (int32_t)orig) {
>> +        register tcg_target_long gp __asm__("$29");
>> +        tcg_target_long gprel = orig - gp;
>> +
>> +        if (gprel == (int32_t)gprel) {
>> +            orig = val = gprel;
>> +            rb = TCG_REG_GP;
>> +        }
>> +    }
> 
> Is it something that really matches a lot of cases? AFAIU most of the
> load store are going to be relative to the cpu_env variable. Also it
> means a 16-bit value that can loaded through a 32-bit offset to gprel
> might be loaded into 2 instructions instead of one.

This code also handles tcg_gen_movi, via INSN_LDA as the opc.  This code
triggers whenever we're loading an address inside the main executable,
such as helper functions.

How do you imagine that two insns might be used instead of one?  This
code is placed late enough that it's 1-2 insns instead of 5.

>> +    if (orig == (int32_t)orig) {
>> +        if (l1 < 0 && orig >= 0) {
>> +            extra = 0x4000;
>> +            l1 = (int16_t)(val - 0x4000);
>> +        }
> 
> I don't get the use case of this. AFAIU a 32-bit signed value can be
> loaded with LDAH + LDA. Here it seems that it end up being loaded with
> LDAH + LDAH + LDA.

Yes.  There are a set of 32-bit signed-extended values that cannot be
loaded with only two insns.  E.g. 0x7fffffff:

        ldah    t0, 0x8000(zero)        // t0 = 0xffffffff 80000000
        lda     t0, -1(t0)              // t0 = 0xffffffff 7fffffff     -- 
incorrect!

        ldah    t0, 0x4000(zero)        // t0 = 0x00000000 40000000
        ldah    t0, 0x4000(t0)          // t0 = 0x00000000 80000000
        lda     t0, -1(t0)              // t0 = 0x00000000 7fffffff     -- 
correct!

>> +    switch (sizeop) {
>> +    case 0 | 4 | 8:
>> +    case 0 | 4:
>> +    case 1 | 4:
>> +    case 2:
>> +        tgen_extend(s, sizeop & 7, ra, ra);
>> +        break;
>> +
>> +    case 0:
>> +    case 0 | 8:
>> +    case 1:
>> +    case 2 | 4:
>> +    case 3:
>> +        break;
> 
> As far as I understand the bit 2 (ie the value 4) means sign extension.
> I find strange to see "2" sign extended and "2 | 4" not sign extended.

Well, that's the instruction set.  LDUB and LDUH zero extend, and LDL sign 
extends.
Thus the signed 8- 16-bit, and unsigned 32-bit need extra help.

>> +static void tcg_out_st_sz(TCGContext *s, int sizeop, TCGReg ra, TCGReg rb,
>> +                          tcg_target_long disp)
>> +{
>> +    static const AlphaOpcode st_opc[4] = {
>> +        INSN_STB, INSN_STW, INSN_STL, INSN_STQ
>> +    };
>> +
>> +    tcg_out_mem_long(s, st_opc[sizeop & 3], ra, rb, disp);
>> +}
> 
> This is technically correct, but any reason why the bswap can be done in
> tcg_out_ld_sz(), and tcg_out_st_sz() doesn't support that?

tcg_out_st_sz would need a temporary.  We only call it from qemu_st, in which
it is obvious that A0 is free.

tcg-out_ld_sz wants to combine the bswap with the sign/zero-extension, and
does not need a temporary.

Probably better to rearrange things such that these functions are both folded
into their callers, so it doesn't seem so strange.

>> +    /* ??? Ideally we'd have access to Elf64_Sym.st_other, which
>> +       would tell us definitively whether the target function uses
>> +       the incoming PV value.  Make a simplifying assumption here
>> +       that all of the compiler-generated code that we're calling
>> +       either computes the GP from the PV in the first two insns
>> +       or it doesn't use the PV at all.  This assumption holds in
>> +       general for just about anything except some hand-written
>> +       assembly, which we're not calling into.  */
>> +
>> +    /* Note we access the insn stream as 16-bit units to avoid having
>> +       to mask out the offsets of the ldah and lda insns.  */
>> +    if (check[1] == 0x27bb && check[3] == 0x23bd) {
>> +        /* Skip the GP computation.  We can do this even if the
>> +           direct branch is out of range.  */
>> +        dest += 8;
>> +    }
>> +
>> +    disp = dest - ((tcg_target_long)s->code_ptr + 4);
>> +    if (disp >= -0x400000 && disp < 0x400000) {
>> +        tcg_out_fmt_br(s, INSN_BSR, TCG_REG_RA, disp >> 2);
>> +    } else {
>> +        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_PV, dest);
>> +        tcg_out_fmt_jmp(s, INSN_JSR, TCG_REG_RA, TCG_REG_PV, dest);
>> +    }
> 
> I don't fully understand what you are trying to do. Is this optimization
> really worthwhile? Skipping two instructions in case of a call, which is
> something expensive, should not make any visible difference.

It's not about skipping two instructions, it's about turning an indirect
call into a direct call, and fixing up the $gp linkage.

Calling conventions say that, on entry to a function, $27 contains the
address of the called function.  If we merely turn the indirect call into
a direct call, we still have to load the function address.  Which would
still be an improvement, as the branch would be predicted properly and it
would not depend on the address load.  But if we do the same thing as the
linker with --relax, and skip the gp load at the beginning of the function,
then we don't need to load the address into $27 either.

>> +    disp = val - ((tcg_target_long)s->code_ptr + 4);
>> +    if (disp >= -0x400000 && disp < 0x400000) {
>> +        tcg_out_fmt_br(s, INSN_BSR, TCG_REG_T9, disp >> 2);
>> +    } else {
>> +        tcg_out_movi(s, TCG_TYPE_PTR, TMP_REG1, val);
>> +        tcg_out_fmt_jmp(s, INSN_JSR, TCG_REG_T9, TMP_REG1, val);
>> +    }
>> +}
> 
> For host architecture which don't provide a direct way to do a division,
> TCG provides some helper. This way these architecture do not have to
> re-invent the wheel.
> 
> For that just define TCG_TARGET_HAS_div_i{32,64} to 0.

But I do have a division routine in libc with special calling conventions,
which do not clobber anything besides $at and $t9, my reserved temporaries.

If I use the tcg routines, I'll kill all of the call-clobbered registers.

>> +    case INDEX_op_not_i32:
>> +    case INDEX_op_not_i64:
>> +        if (const_args[1]) {
>> +            tcg_out_fmt_opi(s, INSN_ORNOT, TCG_REG_ZERO, arg1, arg0);
>> +        } else {
>> +            tcg_out_fmt_opr(s, INSN_ORNOT, TCG_REG_ZERO, arg1, arg0);
>> +        }
>> +        break;
> 
> Do we really want to handle the constant case? This seems a bit overkill
> as the constant propagation pass is able to handle it. Also it's better
> coded as a movi.

I think it was more of a case of "every thing else did".  Certainly we
should not expect to see it here anymore.

>> +static inline void flush_icache_range(unsigned long start, unsigned long 
>> stop)
>> +{
>> +    __asm__ __volatile__ ("call_pal 0x86");
>> +}
> 
> Is it the only way to do that, ie flushing the whole icache and not part
> of it? In that case direct jumps might be more expensive than the
> default load register + jump option.

Yes, it's the only way.

In the case of indirect, we wind up spending 5 insns loading the absolute
address of the link, and 2 more on the branch.  It's not going to be quick.

I guess it all depends on how often the flushes occur.  It's been a long time
since I've been able to test this on hardware, as opposed to emulated within
qemu itself.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]