In what follows I am using AT&T syntax for the AMD64 ISA:
In order to exchange the contents of, say, %rax and %rdx, you can do
xorq %rax, %rdx
xorq %rdx, %rax
xorq %rax, %rdx
It is described in detail on Wikipedia:
https://en.wikipedia.org/wiki/XOR_swap_algorithm. It relies on the fact that the xor operation is associative and commutative and that xoring with a value is its own inverse (i.e. xoring with the same value twice is just the identity operation).
By Agner Fog's instruction tables (
http://www.agner.org/optimize/instruction_tables.pdf), each xor takes 0.25 clock cycles on Intel's Skylake architecture. However, the xors form a dependency chain, so I would guess one has to add 2 clock cycles for the dependencies. This gives an estimate of 2.75 clock cycles.
On the other hand, a single xchg instruction just takes 1 clock cycle.