[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-gcc-list] Re: C vs. assembly performance

From: Nicholas Vinen
Subject: Re: [avr-gcc-list] Re: C vs. assembly performance
Date: Sun, 01 Mar 2009 00:43:22 +1100
User-agent: Thunderbird (X11/20090103)

Georg-Johann Lay wrote:
If you are sure it is really some performance issue/regression and not due to some language standard implication, you can add a report to

so that the subject won't be forgotten. Also mind

And of course, you can ask questions here. In that case it is helpful if you can manage to simplify the source to a small piece of code that triggers the problem and allows others to reproduce the problem. (i.e. no #include in the code, no ... (except for varargs), a.s.o).

Snippets of .s may point to the problem when you add -dp -fverbose-asm

And there are lots of places where avr-gcc produces suboptimal or even bad code, so feedback is welcome.

But note that just a few guys are working on the AVR part of gcc.
I would do more if I had the time (and the support of some gurus to ask questions on internals then and when...)
OK, I only spent a few minutes looking at old code and I found some obviously sub-optimal results. It distills down to this:

#include <avr/io.h>

int main(void) {
  unsigned long packet = 0;

  while(1) {
    if( !(PINC & _BV(PC2)) ) {
      packet = (packet<<1)|(((unsigned char)PINC>>1)&1);
    PORTB = packet;

avr-gcc is: avr-gcc (Gentoo 4.3.3 p1.0, pie-10.1.5) 4.3.3

The avr/io stuff is just so that it won't optimise the code away to nothing.

I tried compiling it with both -Os and -O2:

avr-gcc -g -dp -fverbose-asm -Os -S -mmcu=atmega48 -o test test.c

The result includes this:

	lsl r18	 ;  packet	 ;  50	*ashlsi3_const/2	[length = 4]
	rol r19	 ;  packet
	rol r20	 ;  packet
	rol r21	 ;  packet
	in r24,38-0x20	 ;  D.1214,	 ;  16	*movqi/4	[length = 1]
	lsr r24	 ;  D.1214	 ;  17	lshrqi3/3	[length = 1]
	ldi r25,lo8(0)	 ; ,	 ;  48	*movqi/2	[length = 1]
	ldi r26,lo8(0)	 ; ,	 ;  46	*movhi/4	[length = 2]
	ldi r27,hi8(0)	 ; ,
	andi r24,lo8(1)	 ;  tmp52,	 ;  19	andsi3/2	[length = 4]
	andi r25,hi8(1)	 ;  tmp52,
	andi r26,hlo8(1)	 ;  tmp52,
	andi r27,hhi8(1)	 ;  tmp52,
	or r18,r24	 ;  packet, tmp52	 ;  20	iorsi3/1	[length = 4]
	or r19,r25	 ;  packet, tmp52
	or r20,r26	 ;  packet, tmp52
	or r21,r27	 ;  packet, tmp52

The problem, it seems, is that the compiler doesn't realize that the right hand side of the _expression_ can only have any non-zero values in the bottom 8 bits, since it's an unsigned char which is being implicitly expanded to 32 bits for the or operation. In fact, it's only the bottom bit that's ever non-zero. As a result it's spending a number of cycles and registers doing useless things. I'll copy a report to the locations you mention in your e-mail.

There are probably ways to work around this, such as making "packet" a union of an unsigned char and a long, then shifting the long and only ORing in the unsigned char. I'll note that there's also an optimization to be had with the right hand side of the _expression_. I would write the assembly something like this:

	lsl r18
	rol r19
	rol r20
	rol r21
	in r24,38-0x20
	bst r24, 1
	bld r18, 0

I'm sure I can find other examples of poor code generation in this particular file, since I remember coming across many cases where I replaced the generated code with inline assembly when I was originally working on it, but that will have to wait for later.

Thanks for your help, I appreciate it. As I said avr-gcc is pretty good, but I would love it if it could get even better :)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]