[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Tinycc-devel] Generating better i386 code

From: Jason Hood
Subject: [Tinycc-devel] Generating better i386 code
Date: Thu, 26 Sep 2013 15:39:45 +1000
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130307 Thunderbird/17.0.4


It's rather funny timing that a couple of topics have come up about
optimization and exe size, as I've just spent the past couple of weeks
improving the generated i386 code (most of which would also apply to
x86-64, but I've not done that).  Not sure what the protocol regarding
patches is, so for now you'll find it on pastebin, based on the 0.9.26
release (as one big diff, I'm afraid).


BTW, it looks like the original source was tab-free, but some tabs have
snuck in, so you may want to (de)tabify the whole lot.  I've also made a
couple of spelling corrections.

First off, here's the results, building my tcc.exe (I'm on Windows, so
I'll also be using Intel syntax) with:

original tcc:                  225792 bytes
my tcc, without optimizations: 218624 bytes (3% reduction)
my tcc, with optimizations:    169472 bytes (25% reduction)

Build times are basically the same (using gcc, it was about 0.01s slower
to build with optimizations; using tcc, the optimized version actually
built the optimized version about 0.01s quicker than the original).

The non-optimized version is smaller, as I've made some changes
independent of the optimizations:

* 4- & 8-byte structs copy as int/long long (all targets);
* passing structs <= 8 bytes will be treated as int/long long;
* returning structs <= 8 bytes is done via (edx)eax (PE only);
* added ebx to the register list (increasing prolog by one, to save it);
* use xor r,r instead of mov r,0;
* use the eax-specific form of instructions;
* use movzx after setxx instead of mov r,0 before;
* use movsx for char & short casts, instead of shl+sar;
* use the byte form of sub esp (via enhanced gadd_sp() function);
* gcall_or_jmp() uses symbols and locals directly (like call [ebp-4]);
* use test r,r instead of cmp r,0;
* use inc/dec r instead of add/sub r,-1;
* use movzx r,br/bw instead of and r,0xff/0xffff;
* or r,-1 (should it occur) replaces its mov r,whatever;
* multiply by 0 (should it occur) becomes xor r,r (replacing its mov);
* multiply by -1 becomes neg r;
* make use of imul r,const;
* simplify the float (not) equal test (remove cmp/xor, use jpo/jpe);
* fix add in the assembler, to use the byte form when appropriate.

To support the optimizations, o() must only be used to start an
instruction.  I've added O<N> macros to combine <N> bytes into a single
int and function og() to combine o() and g().

Optimizations are enabled by using -O, but I neglected to add them to
the help:

    -Of - functions
    -Oj - jumps
    -Om - multiplications and pointer division
    -Or - registers
    -O -O2 -Ox - all optimizations
    -O1 - all but -Oj (i.e. -Ofmr)
    -Os - all but -Om (i.e. -Ofjr; also removes PE function alignment)
    -O0 - no optimizations (default)

-Of will minimize the prolog and epilog.  The full prolog is jumped over
as usual, then when the function is finished, write only what is needed,
move everything back (adjusting relocations to suit) and write the
needed epilog.  As suggested above, I've also aligned PE functions to
16 bytes - this always happens, unless -Os is used (maybe it's not needed,
but I'm so used to seeing it in disassembly listings, it just looks wrong
without it :)).

-Oj will optimize various usages of jump.  Jumps to jmp will be replaced
with the destination of the jmp; resulting skipped jmps will be removed.
Common code before a jmp and its destination (up to eight instructions,
the reason for the o() restriction) will result in removal of the code
before the jmp, changing the jmp destination.  Casting to boolean will
use setxx/movzx or stc/sbb/inc when appropriate.  Conditional jumps
over a jmp will invert the condition and change the destination,
removing the jmp.  Jumps to the epilog will be replaced with the epilog
itself (if it's only one or two bytes with -Os).  Appropriate near jumps
will be converted to short.

-Om will use lea (possibly followed by add, shl or another lea) to do
appropriate constant multiplication.  Pointer division is done by
reciprocal multiplication (which should probably also be used for normal
division, don't know why I didn't).

-Or improves register usage.  Previous values are remembered (this would
ideally be done as part of tccgen).  Appropriate function arguments are
pushed directly.  A load const/store pair stores the const directly.
Suitable adds are turned into a displacement (greatly improving struct
and long long access).

A couple of things I didn't do was combine arithmetic operators (even
though register displacement combines adds) or remove unused locals
(remembering register values means writing to a temporary probably won't
read from it).  And doing it all for x86-64 (in particular, returning
small structs should be done, as that's expected by Windows).

In addition, I've tweaked the Win32 build.  Build-tcc.bat will determine
the target based on gcc itself (although it will need modification if
you still want to support command.com).  Separated lib/chkstk.S into
lib/seh.S (assuming only 32-bit) and lib/sjlj.S (assuming only 64-bit);
however, I didn't update the configure process, only build-tcc.bat.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]