[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-gcc-list] [bug] cbi optimization error for 8-bit AVRs
From: |
Georg-Johann Lay |
Subject: |
Re: [avr-gcc-list] [bug] cbi optimization error for 8-bit AVRs |
Date: |
Mon, 17 Nov 2014 18:17:42 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 |
Am 11/09/2014 10:00 PM, schrieb Joern Rennecke:
On 8 November 2014 00:32, Szikra István <address@hidden> wrote:
Hi everyone!
My problem in sort: I’m getting
in r24, 0x18
ldi r25, 0x00
andi r24, 0xEF
out 0x18, r24
instead of
cbi 0x18, 4
.
I’m trying to write efficient modern C/C++ code for multiple platforms
including AVR 8 bit controllers.
Unfortunately GCC support for AVR (among other things) is not always
flawless. And it changes from versions to version (and not always for the
better).
Since I’m a long time AVR developer I have a lot of compiler versions
installed (WinAVR 2006-2010, and Atmel Studio 6.2 with GCC 4.8.1), but I
could test my code with only one (the latest).
I run into some trouble with clearing port bits not translating from C into
cbi in assembler. It is caused by my bits template, but I do not know why.
It seems to me like a bug in GCC. Maybe someone here can shed some light on
the reason, or suggest a fix.
The transformation would seem prima facia best suited for the combine pass.
For the code in question it's already differently emit by C vs. C++ front end.
Consider
#define RR (*(unsigned char volatile*) (0x18 + __AVR_SFR_OFFSET__))
static unsigned bits (int n)
{
return 1u << n;
}
void fun (void)
{
RR &= ~bits(4);
}
as a test case which works both for C and C++. The C++ front fails in applying
a similar type demotion like the C front, and no tree pass performs such type
demotion, either.
When we see the store, we know it's only 8 bit wide, thus we can perform the
arithmetic in 8 bit. However, the store is to a volatile (actually,
I/O) address.
Now, gcc is not very good at optimizing code involving volatile (the
target-independent
code would likely already have simplified the arithmetic if these
weren't in the way -
buy you'd really need a completely different test case without
volatile to keep the
computation relevant).
These memory / I/O accesses require extra ordering constraints, so
large parts of the optimizers just punt when they encounter a volatile
reference.
There is some scope to tweak this - I've attached at proof of concept
patch to optimize
Is there a solution that is more reliable than combine? Currently combine
appears to be the best candidate. However there are many constraints that must
be satisfied for such non-matching split to apply. Optimization must be turned
on (it's not only an optimization issue because CBI/SBI are atomic whereas
IN/OP/OUT sequences are not), combine must have a scratch handy, the single-set
must have been cooked up by combine and from 3 insns, etc. And a matching
split is not possible for obvious reasons.
It actually boils down to supplying combine with a split point it fails to find
itself -- for whatever reason...
your code (based on gcc 5.0). However, this opens the possibility
that it'll break something with respect to volatile ordering now - or
even later down the line.
ISTR we have some more detailed dependency checks at least in some places. in
fact, if there weren't, we should already see the existing cbi pattern
misbehaving.
consider:
typedef unsigned char uint8_t;
main ()
{
int i = (*(volatile uint8_t *)((0x18) + 0x20)) ;
(*(volatile uint8_t *)((0x17) + 0x20)) = 0x0f;
(*(volatile uint8_t *)((0x18) + 0x20)) = i & ~4;
}
if the complier was oblivious to the intervening write it could
generate a cbi here -
but it doesn't - well, at least not at -O2 ...
The combiner must not come up with a CBI-like pattern for this; it would mean
combine drags a volatile access over an other, and that would be incorrect for
the same reasons like for why PR51374 was a combine bug.
Another possible approach would be to use a peephole2 pattern or a
target-specific
peephole2 is even less reliable than combine here :-( because it's rather about
data flow than about code flow. A single (move) insn that has nothing to do
with the operation and peep2 fails... And in the test case there are also
moves involved so that it's almost impossible to recover from that.
optimization pass to do the 16->8 bit arithmetic transformation.
However, if you work after combine (as is definitely the case with peephole2),
you need to
diff --git a/gcc/config/avr/avr.md b/gcc/config/avr/avr.md
index cc85b79..6e7e353 100644
--- a/gcc/config/avr/avr.md
+++ b/gcc/config/avr/avr.md
@@ -679,8 +679,8 @@ (define_expand "mov<mode>"
;; "movqi_insn"
;; "movqq_insn" "movuqq_insn"
(define_insn "mov<mode>_insn"
- [(set (match_operand:ALL1 0 "nonimmediate_operand" "=r ,d ,Qm ,r
,q,r,*r")
- (match_operand:ALL1 1 "nox_general_operand" "r Y00,n Ynn,r
Y00,Qm,r,q,i"))]
+ [(set (match_operand:ALL1 0 "movdst_operand" "=r ,d ,Qm ,r
,q,r,*r")
+ (match_operand:ALL1 1 "nox_general_operand" "r Y00,n Ynn,r
Y00,Qm,r,q,i"))]
"(register_operand (operands[0], <MODE>mode)
|| reg_or_0_operand (operands[1], <MODE>mode)) &&
/* skip if operands are out of lds/sts memory access range(0x40..0xbf)
@@ -3086,6 +3086,20 @@ (define_peephole2 ; andi
operands[1] = GEN_INT (INTVAL (operands[1]) & INTVAL (operands[2]));
})
+; standard promotion can get us a 16 bit int where only an 8 bit value
+; is required. combine-split extend-binop-store to binop-store.
+
+(define_split
+ [(set (match_operand:QI 0 "any_qi_mem_operand")
+ (match_operator 3 "avr_binary_operator"
+ [(match_operand 1 "nonmemory_operand")
+ (match_operand 2 "nonmemory_operand")]))
+ (clobber (match_operand:QI 4 "register_operand"))]
This would work for all 8-bit modes, something like
(define_code_iterator some_binop [xor ior and plus minus mult ashift])
(define_split
[(set (mem:ALL1 (match_operand:ALL1 0 "any_qi_mem_operand" ""))
(some_binop:ALL1 (match_operand:ALL1 1 "nonmemory_operand" "")
(match_operand:ALL1 2 "nonmemory_operand" "")))
(clobber (match_operand:ALL1 3 "register_operand"))]
""
[(set (match_dup 3)
(some_binop:ALL1 (match_dup 1)
(match_dup 2)))
(set (match_dup 0)
(match_dup 3))]
and then something like 1 == GET_MODE_SIZE (mode) in the predicate.
+ ""
+ [(set (match_dup 4) (match_dup 3))
+ (set (match_dup 0) (match_dup 4))]
+ "");
+
;;|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
;; ior
diff --git a/gcc/config/avr/predicates.md b/gcc/config/avr/predicates.md
index 4b456a5..ea8e095 100644
--- a/gcc/config/avr/predicates.md
+++ b/gcc/config/avr/predicates.md
@@ -69,6 +69,24 @@ (define_predicate "nop_general_operand"
(and (match_operand 0 "general_operand")
(match_test "!avr_mem_flash_p (op)")))
+; allow a QImode memory operand, including volatile / I/O.
+(define_predicate "any_qi_mem_operand"
+ (and (match_code "mem")
+ (match_test "address_operand (XEXP (op, 0), Pmode)")
+ (match_test "mode == QImode")))
+
+;; We also allow volatile memory here, so that the combiner can work with
+;; stores for I/O accesses as combiner-splitter results.
+;; As this is for a destination location, that should generally be safe
+;; for combine, as we should he handling the last insn then, which stays
+;; in place.
+;; ??? Could this ever match for a two-destination combination?
+;; ??? Should we make sure this predicate can't match I/O for transformations
+;; with a wider scope, like store sinking?
Unfortunately there is no easy going way to query GCC for the current pass,
e.g. something like combine_in_progress or combine_completed...
May it be the case that some pass changes the access width or split one HI
access to 2*QI, e.g. subreg lowering? Using a different predicate (or
otherwise hacking around volatile_ok) is inevitable to make this solution work
with combine, hence it would be highly appreciated to forge the new predicate
in such a way that it is the same like the original one except for insn combiner.
+(define_predicate "movdst_operand"
+ (ior (match_operand 0 "nonimmediate_operand")
+ (match_operand 0 "any_qi_mem_operand")))
+
;; Return 1 if OP is an "ordinary" general operand, i.e. a general
;; operand whose load is not handled by a libgcc call or ELPM.
(define_predicate "nox_general_operand"
@@ -180,6 +198,10 @@ (define_predicate "simple_comparison_operator"
(and (match_operand 0 "comparison_operator")
(not (match_code "gt,gtu,le,leu"))))
+; A binary operator that the avr can perform in a single insn.
+(define_predicate "avr_binary_operator"
+ (match_code "and,ior,xor,plus,minus"))
+
;; Return true if OP is a valid call operand.
(define_predicate "call_insn_operand"
(and (match_code "mem")