[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32
From: |
Aurelien Jarno |
Subject: |
Re: [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32 |
Date: |
Sun, 19 Jul 2015 13:26:45 +0200 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
On 2015-07-18 23:18, Aurelien Jarno wrote:
> On 2015-07-18 08:58, Richard Henderson wrote:
> > Enforce the invariant that 32-bit quantities are zero extended
> > in the register. This avoids having to re-zero-extend at memory
> > accesses for 32-bit guests.
> >
> > Signed-off-by: Richard Henderson <address@hidden>
> > ---
> > Here's an alternative to the other things we've been considering.
> > We could even make this conditional on USER_ONLY if you like.
> >
> > This does in fact fix the mips test case. Consider the fact that
> > memory operations are probably more common than truncations, and
> > it would seem that we have a net size win by forcing the truncate
> > over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).
>
> I think we should go with your previous patch for 2.4, and think calmly
> about how to do that better for 2.5. It slightly increases the generated
> code, but only in bytes, not in number of instructions, so I don't think
> the performance impact is huge.
>
> > Indeed, for 2.5, we could look at dropping the existing zero-extend
> > from the softmmu path. Also for 2.5, split trunc_shr into two parts,
>
> From a quick look, we need to move the address to new registers anyway,
> so not zero-extending will mean adding the REXW prefix.
Well looking more in details, we can move one instruction from the
fast-path to the slow-path. Here is a typical TLB code for store:
fast-path:
mov %rbp,%rdi
mov %rbp,%rsi
shr $0x7,%rdi
and $0xfffffffffffff003,%rsi
and $0x1fe0,%edi
lea 0x4e68(%r14,%rdi,1),%rdi
cmp (%rdi),%rsi
mov %rbp,%rsi
jne 0x7f45b8bcc800
add 0x10(%rdi),%rsi
mov %ebx,(%rsi)
slow-path:
mov %r14,%rdi
mov %ebx,%edx
mov $0x22,%ecx
lea -0x156(%rip),%r8
push %r8
jmpq 0x7f45cb337010
If we know that %rbp is properly zero-extend when needed, we can change
the end of the fast path into:
cmp (%rdi),%rsi
jne 0x7f45b8bcc800
mov 0x10(%rdi),%rsi
mov %ebx,(%rsi,%rbp,1)
However that means that %rsi is not loaded anymore with the address, so
we have to load it in the slow path. At the end it means moving one
instruction from the fast-path to the slow-path.
Now I have no idea what would really improve the performances. Smaller
fast-path so there are less instructions to execute? Smaller code in
general so that the caches are better used?
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
address@hidden http://www.aurel32.net