Re: [Lightning] [Feature request]: xchgr

In what follows I am using AT&T syntax for the AMD64 ISA:

In order to exchange the contents of, say, %rax and %rdx, you can do

xorq %rax, %rdx

xorq %rdx, %rax

xorq %rax, %rdx

It is described in detail on Wikipedia: https://en.wikipedia.org/wiki/XOR_swap_algorithm. It relies on the fact that the xor operation is associative and commutative and that xoring with a value is its own inverse (i.e. xoring with the same value twice is just the identity operation).

By Agner Fog's instruction tables (http://www.agner.org/optimize/instruction_tables.pdf), each xor takes 0.25 clock cycles on Intel's Skylake architecture. However, the xors form a dependency chain, so I would guess one has to add 2 clock cycles for the dependencies. This gives an estimate of 2.75 clock cycles.

On the other hand, a single xchg instruction just takes 1 clock cycle.

Using a temporary register (say, %rsi), we would code:

movq %rax, %rsi

movq %rdx, %rax

movq %rsi, %rdx

Each move accounts for at most 0.25 clock cycles, dependencies do not add more than 0-1 clock cycle per dependency. Thus, this piece of code is faster than the xor swap algorithm. However, we have to resort to the xor trick when register pressure is so high that there is no free temporary register.

When GNU lightning supports an xchgr instruction and the target processor has no xchg primitive, GNU lightning can emit code corresponding to three moves or three xors depending on whether it can find an unused register or not.

Marc

Paul Cercueil <address@hidden> schrieb am Mi., 4. Okt. 2017 um 13:30 Uhr:

Hi,

I'm curious about that XOR trick, could you describe it?
Is it faster than just using a temporary register for the swap?

And I agree that it would be a useful addition.

Thanks,
-Paul
Le 3 oct. 2017 3:16 PM, Marc Nieper-Wißkirchen <address@hidden> a écrit :
>
> During register allocation when compiling a high level language to a low-level intermediate representation (say, GNU lightning's instruction set), it is often needed to swap the contents of two or more registers. When there is no scratch register available (for example, due to high register pressure), this can be done using the xor trick.
>
> The x64 ISA, however, also provides the xchg instruction which performs better than three xors in a row. Other ISAs may provide similar instructions. Thus I would like to suggest to add an xchgr instruction to GNU lightning's instruction set that exchanges the contents of two registers. By default, it can be implemented with the xor trick. On specific architectures, like the x64, it can be compiled to the native xchg instruction.
>
> Marc

From:	Marc Nieper-Wißkirchen
Subject:	Re: [Lightning] [Feature request]: xchgr
Date:	Wed, 04 Oct 2017 11:57:17 +0000