qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] tcg-i386: Use MOVBE if available


From: Richard Henderson
Subject: Re: [Qemu-devel] [PATCH] tcg-i386: Use MOVBE if available
Date: Sun, 22 Dec 2013 08:38:40 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

On 12/22/2013 04:24 AM, Aurelien Jarno wrote:
> On Sat, Dec 21, 2013 at 03:08:21PM +0100, Paolo Bonzini wrote:
>> Il 21/12/2013 00:00, Richard Henderson ha scritto:
>>> +        if (real_bswap && have_movbe) {
>>> +            tcg_out_modrm_offset(s, OPC_MOVBE_GyMy + P_DATA16 + seg,
>>> +                                 datalo, base, ofs);
>>> +            tcg_out_ext16u(s, datalo, datalo);
>>
>> Do partial register stalls still exist on Atom and Haswell?  I don't
>> remember exactly what you had to do to prevent them, but IIRC you first
>> moved zero to the register and then overwrote the the low 16 bits.
> 
> Note that for unsigned 16-bit load you can do either movzw + bswap or 
> movbe + movzw.

>From the July 2013 Intel Opt Ref Manual,

"Delay of partial register stall is small in ... Intel Core and NetBurst
microarchitectures".  And for Atom "partial register access does not cause
additional delay".

While I agree with Paulo that xor + movbe is probably technically the best, one
has to check for output register overlap and have a fallback.  Thus I think we
can just discard that idea.

As for movzw + bswap, that forces a partial register stall on subsequent 32-bit
access to the value, while movbe + movzw does not.  In the later case we refer
to the unmerged portion of the register in the movzw.

But the optimization note suggests that it shouldn't matter much either way.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]