[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [avr-gcc-list] optimizer

From: Bernard Fouché
Subject: RE: [avr-gcc-list] optimizer
Date: Wed, 24 Nov 2004 15:08:25 +0100

Hi Björn.

Thanks for your answer. I've started to look at the generated code mainly
because of the cost of 32 bits variables. For instance if you apply bitmask
on 32 bits variables, the generated code can be very large:

uint32_t bswap32(uint32_t x)
  return (  (((x) & 0xff000000) >> 24)
          | (((x) & 0x00ff0000) >>  8)
          | (((x) & 0x0000ff00) <<  8)
          | (((x) & 0x000000ff) << 24));

uint32_t bswap32(uint32_t x)
  ca:   ef 92           push    r14
  cc:   ff 92           push    r15
  ce:   0f 93           push    r16
  d0:   1f 93           push    r17
  d2:   7b 01           movw    r14, r22
  d4:   8c 01           movw    r16, r24
  return (  (((x) & 0xff000000) >> 24)
  d6:   89 2f           mov     r24, r25
  d8:   99 27           eor     r25, r25
  da:   aa 27           eor     r26, r26
  dc:   bb 27           eor     r27, r27
  de:   a8 01           movw    r20, r16
  e0:   97 01           movw    r18, r14
  e2:   20 70           andi    r18, 0x00       ; 0
  e4:   30 70           andi    r19, 0x00       ; 0
  e6:   50 70           andi    r21, 0x00       ; 0
  e8:   23 2f           mov     r18, r19
  ea:   34 2f           mov     r19, r20
  ec:   45 2f           mov     r20, r21
  ee:   55 27           eor     r21, r21
  f0:   82 2b           or      r24, r18
  f2:   93 2b           or      r25, r19
  f4:   a4 2b           or      r26, r20
  f6:   b5 2b           or      r27, r21
  f8:   a8 01           movw    r20, r16
  fa:   97 01           movw    r18, r14
  fc:   20 70           andi    r18, 0x00       ; 0
  fe:   40 70           andi    r20, 0x00       ; 0
 100:   50 70           andi    r21, 0x00       ; 0
 102:   54 2f           mov     r21, r20
 104:   43 2f           mov     r20, r19
 106:   32 2f           mov     r19, r18
 108:   22 27           eor     r18, r18
 10a:   82 2b           or      r24, r18
 10c:   93 2b           or      r25, r19
 10e:   a4 2b           or      r26, r20
 110:   b5 2b           or      r27, r21
 112:   5e 2d           mov     r21, r14
 114:   44 27           eor     r20, r20
 116:   33 27           eor     r19, r19
 118:   22 27           eor     r18, r18
 11a:   82 2b           or      r24, r18
 11c:   93 2b           or      r25, r19
 11e:   a4 2b           or      r26, r20
 120:   b5 2b           or      r27, r21
          | (((x) & 0x00ff0000) >>  8)
          | (((x) & 0x0000ff00) <<  8)
          | (((x) & 0x000000ff) << 24));
 122:   bc 01           movw    r22, r24
 124:   cd 01           movw    r24, r26
 126:   1f 91           pop     r17
 128:   0f 91           pop     r16
 12a:   ff 90           pop     r15
 12c:   ef 90           pop     r14
 12e:   08 95           ret

Of course I've instead taken the 32 bits swap shown as an example of
assembly language in the avr-libc documentation :-)

I know nothing of the compiler internals, I just see that 32 bits variable
can be really expensive and should be avoided as much as possible, but
sometimes you don't have any choice. From my own C code and the resulting
assembly code, I didn't see much effective optimizations for 32 bits
variables, rather situations where the cost of using them was very high.

I reached the point where it is more effective (for space saving) to write a
function to perform 32 additions in a single place rather than letting the
compiler generates each time the code for doing a 32 bits addition itself.

Or instead of checking 32 bits (I want to know if the value has changed by
one), I use a 8 bits pointer to the lowest byte.

That leads to C code difficult to read, designed just for gcc and in a few
months someone else will read it and think I had too much beer to write this
kind of things and will rewrite it to see that the object size explodes
otherwise. [depressive mode off]

Another optimization I saw on the ICC compiler (I think) was that the
compiler, when asked for space optimization, used if possible the end of
another function if the code was the same. For instance many functions end
with a series of 'pop', and since the register use order seems to be
designed for this purpose, it was possible to branch to the end of another
function to perform the same pops. (The same for stack manipulation: once
the new stack value is calculated, the code can branch to somewhere that
already updates SPL/SPH/SREG.)

At last I ran again into a situation where I have no .data segment but the
linker brings in the code to initialize this segment anyway.


-----Message d'origine-----
De : Haase Bjoern (PT-BEU/MKP5) * [mailto:address@hidden
Envoyé : mercredi 24 novembre 2004 13:56
À : address@hidden; Bernard Fouché; address@hidden
Objet : AW: [avr-gcc-list] optimizer


I have observed similar situations where the optimized generated code could
be realized
with much less registers: Mainly when dealing with global variables of more
than 8 bit
word length.

I also have been thinking about improving the compiler. I came
to the conclusion, that it is probably difficult to solve this problem:
The core problem seems to be that the compiler internally
considers r24:r25:r26:r27 to be one single logical
32 bit register r24. It seems that this logical 32 bit register is
broken down to 4x8 bit objects at the very last step only, i.e. when issuing
assembler instructions.

In order to implement your suggested optimizations, it would probably be
necessary, to convert
all the 32 bit objects to 8 bit objects already at an earlier stage during
the compilation, i.e.
at the RTL level. This, however, probably would make it almost impossible to
generate object code
that could be used in a debugger. This might also prevent a lot of other
useful optimization steps
that require the variables to be considered as monolithic 32 bit quantities.

I have come to the conclusion, that the possible benefit of an early 32
bit -> 4x8 Bit
splitting also mainly affects code that uses global variables and does not
help much when
dealing with the more commonly present case that variables are held in
registers. Possibly your code
could be improved if you try to avoid global variables.

IMHO the possible benefit of a 32-> 4x8 splitting at the RTL level does not
really justify
the required amount of changes in the compiler.


-----Original Message-----
From: address@hidden [mailto:address@hidden
On Behalf Of Bernard Fouché
Sent: Wednesday, 24 November 2004 7:18 PM
To: address@hidden
Subject: [avr-gcc-list] optimizer


I'm compiling with -Os for atmega64 with avr-gcc 3.4.2. When I have

uint32_t var;


the generated code is, for instance:

 var=(uint32_t)eeprom_read_byte((uint8_t *)EEPROM_PARM);
ldi     r24, 0x36       ; 54
ldi     r25, 0x00       ; 0
call    0xf9c0
eor     r25, r25
eor     r26, r26
eor     r27, r27
sts     0x046B, r24
sts     0x046C, r25
sts     0x046D, r26
sts     0x046E, r27

Could it be instead:
ldi     r24, 0x36       ; 54
ldi     r25, 0x00       ; 0
call    0xf9c0
sts     0x046B, r24
sts     0x046C, r1
sts     0x046D, r1
sts     0x046E, r1

That would spare 6 bytes...


avr-gcc-list mailing list
address@hidden http://www.avr1.org/mailman/listinfo/avr-gcc-list

avr-gcc-list mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]