[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 1/9] TCG "sync" op
From: |
Aurelien Jarno |
Subject: |
Re: [Qemu-devel] [PATCH 1/9] TCG "sync" op |
Date: |
Fri, 16 Oct 2009 19:29:08 +0200 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
On Fri, Oct 16, 2009 at 06:37:31PM +0200, Ulrich Hecht wrote:
> On Friday 16 October 2009, Aurelien Jarno wrote:
> > IMHO, the correct way to do it is to use the following code, assuming
> > you want to use 64-bit TCG regs to hold 32-bit values (that's
> > something that is not really clear in your next patch):
> >
> > - for register load:
> > | static TCGv load_reg(int reg)
> > | {
> > | TCGv r = tcg_temp_new_i64();
> > | tcg_gen_ext32u_i64(r, tcgregs[reg]);
> > | return r;
> > | }
> > |
> > | static void store_reg32(int reg, TCGv v)
> > | {
> > | tcg_gen_ext32u_i64(v, v); /* may be optional */
> > | tcg_gen_andi_i64(tcgregs[reg], tcgregs[reg],
> > | 0xffffffff00000000ULL); tcg_gen_or_i64(tcgregs[reg], tcgregs[reg],
> > | v);
> > | }
>
> This is _extremely_ detrimental to performance. The point of the sync op
> is that in most cases it's a nop because registers are usually used with
> the same bitness again and again. The sign extension/masking stuff is
I don't really understand how it can be simply be a nop, given it calls
temp_save. It means if a register is used twice in a BB, it is fetch
again from memory.
> done every time a register is accessed as 32 bits, which is the most
> common case. Compare the translation of the following sequence of
> instructions:
>
> IN: _dl_aux_init
> 0x0000000080044ff6: lhi %r4,0
> 0x0000000080044ffa: lhi %r5,0
> 0x0000000080044ffe: lhi %r0,0
This example is a bit biased, as registers are only saved, and never
reused. Let's comment on it though.
> with sync:
>
> OP:
> mov_i32 loc0,global_cc
> movi_i32 tmp1,$0x0
> sync_i64 R4
> mov_i32 r4,tmp1
> movi_i32 tmp1,$0x0
> sync_i64 R5
> mov_i32 r5,tmp1
> movi_i32 tmp1,$0x0
> sync_i64 R0
> mov_i32 r0,tmp1
> mov_i32 global_cc,loc0
> movi_i64 tmp2,$0x80045002
> st_i64 tmp2,env,$0x158
> exit_tb $0x0
>
> OUT: [size=61]
> 0x6019a030: mov 0x160(%r14),%ebp
> 0x6019a037: mov %rbp,%rbx
> 0x6019a03a: mov $0x80045002,%r12d
> 0x6019a040: mov %r12,0x158(%r14)
> 0x6019a047: mov %ebp,0xd1a0(%r14)
> 0x6019a04e: mov %ebx,0x160(%r14)
> 0x6019a055: xor %ebp,%ebp
> 0x6019a057: mov %ebp,(%r14)
> 0x6019a05a: xor %ebp,%ebp
> 0x6019a05c: mov %ebp,0x20(%r14)
> 0x6019a060: xor %ebp,%ebp
> 0x6019a062: mov %ebp,0x28(%r14)
> 0x6019a066: xor %eax,%eax
> 0x6019a068: jmpq 0x621dc8ce
>
>
> with sign extension:
>
> OP:
> mov_i32 loc0,global_cc
> movi_i32 tmp1,$0x0
> ext32u_i64 tmp1,tmp1
> movi_i64 tmp2,$0xffffffff00000000
> and_i64 R4,R4,tmp2
> or_i64 R4,R4,tmp1
> movi_i32 tmp1,$0x0
> ext32u_i64 tmp1,tmp1
> movi_i64 tmp2,$0xffffffff00000000
> and_i64 R5,R5,tmp2
> or_i64 R5,R5,tmp1
> movi_i32 tmp1,$0x0
> ext32u_i64 tmp1,tmp1
> movi_i64 tmp2,$0xffffffff00000000
> and_i64 R0,R0,tmp2
> or_i64 R0,R0,tmp1
> mov_i32 global_cc,loc0
> movi_i64 tmp2,$0x80045002
> st_i64 tmp2,env,$0x158
> exit_tb $0x0
>
> OUT: [size=126]
> 0x6019af10: mov 0x160(%r14),%ebp
> 0x6019af17: xor %ebx,%ebx
> 0x6019af19: mov %ebx,%ebx
> 0x6019af1b: mov 0x20(%r14),%r12
> 0x6019af1f: mov $0xffffffff00000000,%r13
> 0x6019af29: and %r13,%r12
> 0x6019af2c: or %rbx,%r12
> 0x6019af2f: xor %ebx,%ebx
> 0x6019af31: mov %ebx,%ebx
> 0x6019af33: mov 0x28(%r14),%r13
> 0x6019af37: mov $0xffffffff00000000,%r15
> 0x6019af41: and %r15,%r13
> 0x6019af44: or %rbx,%r13
> 0x6019af47: xor %ebx,%ebx
> 0x6019af49: mov %ebx,%ebx
> 0x6019af4b: mov (%r14),%r15
> 0x6019af4e: mov $0xffffffff00000000,%r10
> 0x6019af58: and %r10,%r15
> 0x6019af5b: or %rbx,%r15
> 0x6019af5e: mov %rbp,%rbx
> 0x6019af61: mov $0x80045002,%r10d
> 0x6019af67: mov %r10,0x158(%r14)
> 0x6019af6e: mov %ebp,0xd1a0(%r14)
> 0x6019af75: mov %ebx,0x160(%r14)
> 0x6019af7c: mov %r15,(%r14)
> 0x6019af7f: mov %r12,0x20(%r14)
> 0x6019af83: mov %r13,0x28(%r14)
> 0x6019af87: xor %eax,%eax
> 0x6019af89: jmpq 0x621dd78e
>
> Its more than twice the size and has ten memory accesses instead of
> seven.
>
There is clearly a huge impact for saving the globals, that I didn't
expect. I still believe a sync op as implemented in your patch is in
opposite direction to TCG's philosophy, probably does not work with
fixed registers, and I don't understand what is the gain compared to
the use of tcg_gen_ld/st. Maybe you can post a dump of such a version
so that we can see the benefit?
I think instead of a sync op, we should think of a way to use only the
low part of a global variable, maybe by adding new ops. That can also
help to improve the concat_i32_i64 op on 64-bit hosts. Does someone has
any idea?
Aurelien
--
Aurelien Jarno GPG: 1024D/F1BCDB73
address@hidden http://www.aurel32.net
Re: [Qemu-devel] [PATCH 1/9] TCG "sync" op, Aurelien Jarno, 2009/10/16
Re: [Qemu-devel] [PATCH 1/9] TCG "sync" op, Edgar E. Iglesias, 2009/10/17
Re: [Qemu-devel] [PATCH 1/9] TCG "sync" op, Ulrich Hecht, 2009/10/19