[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-gcc-list] Re: C vs. assembly performance

From: Georg-Johann Lay
Subject: Re: [avr-gcc-list] Re: C vs. assembly performance
Date: Sat, 28 Feb 2009 17:00:17 +0100
User-agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)

Nicholas Vinen schrieb:
Georg-Johann Lay wrote:

If you are sure it is really some performance issue/regression and not
due to some language standard implication, you can add a report to

so that the subject won't be forgotten. Also mind

And of course, you can ask questions here. In that case it is helpful
if you can manage to simplify the source to a small piece of code that
triggers the problem and allows others to reproduce the problem. (i.e.
no #include in the code, no ... (except for varargs), a.s.o).

Snippets of .s may point to the problem when you add -dp -fverbose-asm

And there are lots of places where avr-gcc produces suboptimal or even
bad code, so feedback is welcome.

But note that just a few guys are working on the AVR part of gcc.
I would do more if I had the time (and the support of some gurus to
ask questions on internals then and when...)

OK, I only spent a few minutes looking at old code and I found some
obviously sub-optimal results. It distills down to this:

#include <avr/io.h>

int main(void) {
  unsigned long packet = 0;

  while(1) {
    if( !(PINC & _BV(PC2)) ) {
      packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
    PORTB = packet;

avr-gcc is: avr-gcc (Gentoo 4.3.3 p1.0, pie-10.1.5) 4.3.3

The avr/io stuff is just so that it won't optimise the code away to nothing.

Please avoid the #include stuff. You can use source like this:

#define PINC  (*((unsigned char volatile*) 0x20))
#define PORTB (*((unsigned char volatile*) 0x21))

void foo ()
    unsigned long packet = 0;

        if (!(PINC & (1 << 2)))
            packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
        PORTB = packet;

I tried compiling it with both -Os and -O2:

avr-gcc -g -dp -fverbose-asm -Os -S -mmcu=atmega48 -o test test.c

The result includes this:

        lsl r18  ;  packet       ;  50  *ashlsi3_const/2        [length = 4]
        rol r19  ;  packet
        rol r20  ;  packet
        rol r21  ;  packet
        in r24,38-0x20   ;  D.1214,      ;  16  *movqi/4        [length = 1]
        lsr r24  ;  D.1214       ;  17  lshrqi3/3       [length = 1]
        ldi r25,lo8(0)   ; ,     ;  48  *movqi/2        [length = 1]
        ldi r26,lo8(0)   ; ,     ;  46  *movhi/4        [length = 2]
        ldi r27,hi8(0)   ; ,
        andi r24,lo8(1)  ;  tmp52,       ;  19  andsi3/2        [length = 4]
        andi r25,hi8(1)  ;  tmp52,
        andi r26,hlo8(1)         ;  tmp52,
        andi r27,hhi8(1)         ;  tmp52,
        or r18,r24       ;  packet, tmp52        ;  20  iorsi3/1        [length 
= 4]
        or r19,r25       ;  packet, tmp52
        or r20,r26       ;  packet, tmp52
        or r21,r27       ;  packet, tmp52

The problem, it seems, is that the compiler doesn't realize that the
right hand side of the expression can only have any non-zero values in
the bottom 8 bits, since it's an unsigned char which is being implicitly
expanded to 32 bits for the or operation. In fact, it's only the bottom
bit that's ever non-zero. As a result it's spending a number of cycles
and registers doing useless things. I'll copy a report to the locations
you mention in your e-mail.

Note that this may partially be covered by report 145284 (which I cannot find, maybe Eric has closed/removed it)

I already filed a patch for that in
that covers your issue to some extent or maybe almost complete: The new pattern "*ior<MODE>2_<MODE>bit0" would match some parts of your code.

There are probably ways to work around this, such as making "packet" a
union of an unsigned char and a long, then shifting the long and only
ORing in the unsigned char. I'll note that there's also an optimization
to be had with the right hand side of the expression. I would write the
assembly something like this:

>    lsl r18
>    rol r19
>    rol r20
>    rol r21
>    in r24,38-0x20
>    bst r24, 1
>    bld r18, 0

The result of the above patch should lead to something like
        lsl r18
        rol r19
        rol r20
        rol r21
        in r24,38-0x20
        bst r24, 1
        sbrs r18, 0
        bld r18, 0
The SBRS is necessary, because the pattern is not aware of the fact that r18.0 is 0. Maybe the optimization is even better (or waeker); I am not using that patch at the moment and can just estimate its effect when peeking into rtl dumps of an unpatched gcc.

Concerning the patch itself, I don't know anything about its fate and if it will ever make its way into gcc because of administrative obstacles and the technique I used.

I don't like the technique I am using because it leads to complex patterns that are hard to understand an test and will become useless if the middleend decides to represent the stuff in a slightly different way...


reply via email to

[Prev in Thread] Current Thread [Next in Thread]