[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[avr-gcc-list] fixed-point: code size, speed and precision
From: |
Georg-Johann Lay |
Subject: |
[avr-gcc-list] fixed-point: code size, speed and precision |
Date: |
Wed, 15 Aug 2012 11:28:07 +0200 |
User-agent: |
Thunderbird 2.0.0.24 (Windows/20100228) |
Some day ago I tried to adapt Sean's fixed point patch [1] resp.
the abandoned attempt [2] to avr-gcc 4.8. A current version thereof
can be seen in [3], but there are still several issues.
One issue is the precision of multiplication.
A round to nearest would come up with a rounding error of up to
0.5 LSBs compared to the correct result.
A round up / round down / round to zero would come up with an
error less than 1 LSB.
ISO/IEC TR18037 allows a rounding error up to 2 LSBs.
The current implementation of [U]SAmode ([Un]signed Single Accum)
has an error of more than 2 LSBs.
The reason is that the least significant bytes are not taken into
account in the arithmetic and thus the respective carries are lost.
One example is 0xff.01ff * 0xff.01ff where the product if off by
0x2.fc LSBs which is very close to 3 LSBs.
This leads to a premature accumulation of computation errors and
poor conditioned arithmetic.
There are basically 2 ways to perform an operation:
1) Use all parts of the involved operands and adjust the
result to yield the desired format.
2) Only perform sub-computations whose result will overlap
the destination. Carries from LSBs will be lost.
Approach 1) leads to the exact result so that the algorithms can be
sure to round the /exact/ result to get the rounded result (as
required by TR18037 for instance).
Approach 2 is faster but does not offer control over rounding
errors and cannot be used if a saturated result is needed.
The above rounding problem can be fixed by a 64 = 32 * 32
widening multiplication, but the code would run slower and
need more flash.
The current implementation in lib1funcs-fixed.S:
__mulusa3: 104B
__mulsa3 : 108B
A widening multiply will need about 15 instructions more:
__umulsidi3: 74B
__umulhisi3: 30B
__muldi3_6: 18B
The latter two routines can be shared with other code:
__umulhisi3 is likely shared, __muldi3_6 unlikely.
If there will ever be an optimized version of the saturated
counterparts, then they will also use __umulsidi3 et al.
What is the best approach here?
Code that is slower, might consume more flash and stack but
complies to TR18037? Or fast code whose rounding behavior
if not withing the 2 LSBs as of TR18037?
A code that uses both signed and unsigned versions will
consume ~200 bytes for the multiplications alone.
Sign extension can be performed in three different ways:
1) Explicit before the computation
2) Implicit during the computation
3) Explicit after the computation
[3] currently uses 2) but could reuse the unsigned version and then
consumes 22 bytes by means of 3) like so:
DEFUN __mulsa3
XCALL __mulusa3
tst B3
brpl 1f
sub C2, A0
sbc C3, A1
1: sbrs A3, 7
ret
sub C2, B0
sbc C3, B1
ret
ENDF __mulsa3
Thus, if both the signed and the unsigned versions are needed,
the code size will go down by more than 80 bytes.
If only the signed version is used, code size goes up by 20 bytes.
What's the best here?
Johann
--
[1]
http://distribute.atmel.no/tools/opensource/avr-gcc/gcc-4.5.1/30-gcc-4.5.1-fixedpoint-3-4-2010.patch
[2]
http://lists.gnu.org/archive/html/avr-gcc-list/2012-07/msg00030.html
[3]
http://gcc.gnu.org/ml/gcc-patches/2012-08/msg00632.html
- [avr-gcc-list] fixed-point: code size, speed and precision,
Georg-Johann Lay <=