[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-gcc-list] fixed-point: code size, speed and precision
From: |
Georg-Johann Lay |
Subject: |
Re: [avr-gcc-list] fixed-point: code size, speed and precision |
Date: |
Mon, 20 Aug 2012 20:05:25 +0200 |
User-agent: |
Thunderbird 2.0.0.24 (Windows/20100228) |
Erik Christiansen schrieb:
On 15.08.12 11:28, Georg-Johann Lay wrote:
Some day ago I tried to adapt Sean's fixed point patch [1] resp.
the abandoned attempt [2] to avr-gcc 4.8. A current version thereof
can be seen in [3], but there are still several issues.
Many thanks, Johann, for the great work that you are doing. (And for
giving the users some involvement.)
My thoughts stimulated by your questions are just a peripheral opinion
- there are likely to be others making more immediate use of fixed point
arithmetic in C.
...
Approach 1) leads to the exact result so that the algorithms can be
sure to round the /exact/ result to get the rounded result (as
required by TR18037 for instance).
Approach 2 is faster but does not offer control over rounding
errors and cannot be used if a saturated result is needed.
While list traffic seems to mostly prefer small code to super fast code
(and I concur), precise code takes precedence over all other
considerations, I feel, especially if the code size saving is less than
a factor of 20.
...
What is the best approach here?
Code that is slower, might consume more flash and stack but
complies to TR18037? Or fast code whose rounding behavior
if not withing the 2 LSBs as of TR18037?
I'm having trouble finding value in an incorrect result generated
rapidly.
Hi Erik, thanks for your annotations.
Correctness is just a natter of the epsilon that you allow for
results to deviate from the exact computation. So it's rather
about fuzzyness than about correctness.
A code that uses both signed and unsigned versions will
consume ~200 bytes for the multiplications alone.
Sign extension can be performed in three different ways:
1) Explicit before the computation
2) Implicit during the computation
3) Explicit after the computation
[3] currently uses 2) but could reuse the unsigned version and then
consumes 22 bytes by means of 3) like so:
DEFUN __mulsa3
XCALL __mulusa3
tst B3
brpl 1f
sub C2, A0
sbc C3, A1
1: sbrs A3, 7
ret
sub C2, B0
sbc C3, B1
ret
ENDF __mulsa3
Thus, if both the signed and the unsigned versions are needed,
the code size will go down by more than 80 bytes.
Less than 120 bytes for both is great.
If only the signed version is used, code size goes up by 20 bytes.
But still only to 120 bytes, AIUI.
What's the best here?
A lean unsigned, and equal size for signed and signed_plus_unsigned,
looks like a good compromise from here, given that it's half of the
worst case.
Ok, I decided to give the original code a face lift. The signed
version is like above and the unsigned code works with an error
of <= 0.5 LSBs which is better than the original 3 LSBs.
Moreover, it does not need an extra zero register.
FYI, the code is now
;;; (C3:C0) = (A3:A0) * (B3:B0)
;;; Clobbers: __tmp_reg__
;;; Rounding: -0.5 LSB < error <= 0.5 LSB
DEFUN __mulusa3
;; Some of the MUL instructions have LSBs outside the result.
;; Don't ignore these LSBs in order to tame rounding error.
;; Use C2/C3 for these LSBs.
clr C0
clr C1
mul A0, B0 $ movw C2, r0
mul A1, B0 $ add C3, r0 $ adc C0, r1
mul A0, B1 $ add C3, r0 $ adc C0, r1 $ rol C1
;; Round
sbrc C3, 7
adiw C0, 1
;; The following MULs don't have LSBs outside the result.
;; C2/C3 is the high part.
mul A0, B2 $ add C0, r0 $ adc C1, r1 $ sbc C2, C2
mul A1, B1 $ add C0, r0 $ adc C1, r1 $ sbci C2, 0
mul A2, B0 $ add C0, r0 $ adc C1, r1 $ sbci C2, 0
neg C2
mul A0, B3 $ add C1, r0 $ adc C2, r1 $ sbc C3, C3
mul A1, B2 $ add C1, r0 $ adc C2, r1 $ sbci C3, 0
mul A2, B1 $ add C1, r0 $ adc C2, r1 $ sbci C3, 0
mul A3, B0 $ add C1, r0 $ adc C2, r1 $ sbci C3, 0
neg C3
mul A1, B3 $ add C2, r0 $ adc C3, r1
mul A2, B2 $ add C2, r0 $ adc C3, r1
mul A3, B1 $ add C2, r0 $ adc C3, r1
mul A2, B3 $ add C3, r0
mul A3, B2 $ add C3, r0
clr __zero_reg__
ret
ENDF __mulusa3
If it matters whether signed multiply takes 120 bytes rather than 100,
then it was time to move up to the next flash size several months ago.
At Siemens we were never allowed to go into production using more than
80% (IIRC) of ROM, and at NEC I liked to follow the same practice.
That way, if there was a software upgrade, there was room for it.
(Anything more is a new product. Even the managers understood that.)
Ya, you are right. The tools cannot work around inappropriate hardware
selection.
Johann
Thank you Johann, for asking the users.
Erik