[avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed)

avr-gcc-list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed)

From:	Thomas Watson
Subject:	[avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed)
Date:	Sat, 18 Oct 2014 12:08:23 -0500

GCC is generating substantially less optimized code than it does if I help it 
along a bit. Code is at http://pastie.org/private/awus9tkgdwbzpdwjgrbw and the 
assembler output is at http://pastie.org/private/s4liesmrd9f6fi2wahe0vg . Top 
block is with cx and bottom block is modifying the argument instead of copying 
it to a local. Full compiler invocation is: avr-gcc -c -I. -mmcu=atmega328p 
-std=gnu99 -Os -Wall -DF_CPU=16000000  -ffunction-sections -fdata-sections 
-Wl,--gc-sections -o tft.o tft.c . I thought that copying x to a local might be 
wasting a bit of space. However, if we modify x directly rather than copying it 
to a local before modification, the compiler decides to store x on the stack 
instead of in a register which takes us on a journey involving unnecessary 
stack access, silly re-copying, and far too much code. 

I figured that if I didn't copy it to a local, I would save code space (like I 
do in many other situations) but something is going wrong here. If I have cx, a 
callee-saved register is reserved for it (line 33) and x is copied to cx (56). 
When we call tft_draw_chr, it expects the 'x' parameter in r24, so we copy it 
there (68) before we call. Since r24 might be eaten by tft_draw_chr, we can't 
use it to store x through the call and not have to bother with r17. Anyway, 
once we return, we add FONT_WIDTH to r17 (73) in preparation for the next 
iteration of the loop. In addition, since we can use Y for the string pointer, 
we do not need to worry about it being eaten by tft_draw_chr and it is only 
pushed and popped in the prologue and epilogue. All well and dandy, right? In 
theory, since x is never touched before or after cx is assigned, it is 
essentially an alias. I would therefore expect exactly the same code (or 
perhaps more optimized if the architecture and calling convention supported it) 
to be generated.

However, such is not the case. x is passed into the function in r24, but we 
want to modify it and have it persist through the loop. Because r24 could be 
mangled by calling a function, we obviously must move it to elsewhere. 
Unfortunately, the compiler decides on r25 (170), a register subject to the 
same limitation. As before, we must move our temporary register to r24 (186) in 
order to call tft_draw_chr, according to calling convention. However, since r25 
could be mangled by the call, we have to save it (187) before the function 
call. The compiler chooses the stack, as opposed to a callee-saved register, 
which has rather broad implications. First, we must reserve stack space (159) 
and copy the stack pointer to Y (162), chosen presumably because Y is also 
callee-saved. But since we used Y as the string pointer before, we must store 
the string pointer elsewhere. R8/9 are chosen. As callee-saved registers, we 
must perform an additional two pushes and pops to save them at the beginning 
and end of the function. We also have to move the string pointer there (174).

Okay. So we've returned from tft_draw_chr (192). We must pull x off the stack 
into r25 and add FONT_WIDTH to it (193) in preparation for the next iteration. 
We could have just as easily not used r25 and continued to use r24, using the 
stack to save it as before (but there is a better way). We know r24 won't be 
touched until we call tft_draw_chr. Now that that's over, we have to fetch the 
next character in the string, but because Y isn't our string pointer, this 
doesn't go smoothly. We can't load data if the address isn't in X, Y, or Z. 
Since r8/9 is none of those, we have to copy it to Z (198) to retrieve the next 
character and do a post-increment on Z to index the next character (199). Since 
Z isn't callee-saved, it might be mangled by a function call, so we must store 
it back to r8/9 (200). Finally, we can test for the next iteration.

I'm not sure why the second code doesn't end up the same as the first. Choosing 
to use another caller-saved register as our temporary register is an extremely 
poor choice. If for some reason that was mandatory, we could (at least in this 
code) still use r24 to avoid having to copy between it and r25. However, 
instead of using a register like r9 to store our temporary register, we use one 
that isn't callee-saved, which means we still end up using r9 (and r8 too!) in 
our quest to needlessly use the stack.

This is probably way too verbose but there must be some useful information in 
there somewhere. Please take a look. Also, CC me on any replies because I'm not 
subscribed to the list yet.

Thank you all,
Thomas

[Prev in Thread]

Current Thread

[Next in Thread]

[avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed), Thomas Watson <=
- Re: [avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed), Jeremy Bennett, 2014/10/21

Prev by Date: Re: [avr-gcc-list] New avr-gcc device specific specs are breaking the toolchain
Next by Date: Re: [avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed)
Previous by thread: [avr-gcc-list] [patch, avr-libc] Fix atexit.c
Next by thread: Re: [avr-gcc-list] Optimization Hiccup? (Please CC me, I'm not subscribed)
Index(es):
- Date
- Thread