Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*

From:	Richard Henderson
Subject:	Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
Date:	Thu, 29 Aug 2013 09:08:20 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130805 Thunderbird/17.0.8

On 08/29/2013 08:31 AM, Paolo Bonzini wrote:
> Il 27/08/2013 23:46, Richard Henderson ha scritto:
>> This does require the fast path always load to the function return
>> value register, but apparently the loaded value usually needs to be
>> spilled back to its memory slot anyway so the change in register
>> does not really change much.
> 
> Even for something like
> 
>     mov (%rdi), %rax
>     add (%r8), %rax
> 
> ?  Memory operands should avoid the need to spill anything.

No, not that kind of spilling.  The kind in which subsequent operations
in the TB perform something that requires the register backing store
within env be up to date.  Thus a "spill" from register to the memory slot.

I could have used better verbage i guess...

In theory, having the load go to a call-saved register instead of the
call-clobbered return value register would mean that (even with updating the
memory backing store) if we had a future read-only use of the register within
the basic block we could re-use the value in the register.

It's just that I didn't see that ever happen in the couple of MB of dumps that
I saw.  Thus my conclusion that it's rare enough not to worry about it.

> Is this change really an advantage considering the additional icache
> footprint of the new helpers?

We *are* getting fractionally smaller TBs, for the low low price of three
functions like

> 00000000005f2df0 <helper_ret_ldsw_mmu>:
>   5f2df0:       48 83 ec 18             sub    $0x18,%rsp
>   5f2df4:       64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
>   5f2dfb:       00 00 
>   5f2dfd:       48 89 44 24 08          mov    %rax,0x8(%rsp)
>   5f2e02:       31 c0                   xor    %eax,%eax
>   5f2e04:       e8 27 fe ff ff          callq  5f2c30 <helper_ret_lduw_mmu>
>   5f2e09:       48 8b 54 24 08          mov    0x8(%rsp),%rdx
>   5f2e0e:       64 48 33 14 25 28 00    xor    %fs:0x28,%rdx
>   5f2e15:       00 00 
>   5f2e17:       48 0f bf c0             movswq %ax,%rax
>   5f2e1b:       75 05                   jne    5f2e22 
> <helper_ret_ldsw_mmu+0x32>
>   5f2e1d:       48 83 c4 18             add    $0x18,%rsp
>   5f2e21:       c3                      retq   
>   5f2e22:       e8 79 a7 e1 ff          callq  40d5a0 <address@hidden>
>   5f2e27:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
>   5f2e2e:       00 00 

Spend 64 * 3 bytes on new helpers, and save 10 * zillion bytes on the TBs.  In
all likelyhood we're saving icache rather than using additional.

Where I do think there's cause for treading carefully is wrt Aurelien's
statement "it's the slow path exception, the call-return stack doesn't matter".
 Alternately, given that it *is* the slow path, who cares if the return from
the helper immediately hits a branch, rather than tail-calling back into the
fast path, if the benefit is that the call-return stack is still valid above
the code_gen_buffer after a simple tlb miss?

As an aside, why why o why do we default to -fstack-protector-all?  Do we
really need checks in every single function, as opposed to those that actually
do something with arrays?  Switch to plain -fstack-protector so we have

> 00000000005a1fd0 <helper_ret_ldsw_mmu>:
>   5a1fd0:       48 83 ec 08             sub    $0x8,%rsp
>   5a1fd4:       e8 57 fe ff ff          callq  5a1e30 <helper_ret_lduw_mmu>
>   5a1fd9:       48 83 c4 08             add    $0x8,%rsp
>   5a1fdd:       48 0f bf c0             movswq %ax,%rax
>   5a1fe1:       c3                      retq   
>   5a1fe2:       66 66 66 66 66 2e 0f    data32 data32 data32 data32 nopw 
> %cs:0x0(%rax,%rax,1)
>   5a1fe9:       1f 84 00 00 00 00 00 

and then lets talk about icache savings...

r~

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 0/7] Further tcg ldst improvements, Richard Henderson, 2013/08/27
- [Qemu-devel] [PATCH 1/7] exec: Reorganize the GETRA/GETPC macros, Richard Henderson, 2013/08/27
- [Qemu-devel] [PATCH 4/7] target: Include softmmu_exec.h where forgotten, Richard Henderson, 2013/08/27
- [Qemu-devel] [PATCH 3/7] exec: Rename USUFFIX to LSUFFIX, Richard Henderson, 2013/08/27
- [Qemu-devel] [PATCH 5/7] exec: Split softmmu_defs.h, Richard Henderson, 2013/08/27
- [Qemu-devel] [PATCH 2/7] tcg-i386: Don't perform GETPC adjustment in TCG code, Richard Henderson, 2013/08/27
- [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu, Richard Henderson, 2013/08/27
  - Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu, Richard Henderson, 2013/08/28
  - Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu, Paolo Bonzini, 2013/08/29
    - Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu, Richard Henderson <=
    - Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu, Paolo Bonzini, 2013/08/29
  - Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu, Aurelien Jarno, 2013/08/29
    - Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu, Richard Henderson, 2013/08/29
- [Qemu-devel] [PATCH 6/7] tcg: Introduce zero and sign-extended versions of load helpers, Richard Henderson, 2013/08/27

Prev by Date: Re: [Qemu-devel] [PATCH 02/18] qapi-types.py: Implement 'base' for unions
Next by Date: Re: [Qemu-devel] [RFC 2/5] hw/arm: add very initial support for Canon DIGIC SoC
Previous by thread: Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
Next by thread: Re: [Qemu-devel] [PATCH 7/7] tcg-i386: Perform tail call to qemu_ret_ld*_mmu
Index(es):
- Date
- Thread