avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-gcc-list] fixed-point: code size, speed and precision


From: Georg-Johann Lay
Subject: Re: [avr-gcc-list] fixed-point: code size, speed and precision
Date: Mon, 20 Aug 2012 20:05:25 +0200
User-agent: Thunderbird 2.0.0.24 (Windows/20100228)

Erik Christiansen schrieb:
On 15.08.12 11:28, Georg-Johann Lay wrote:
Some day ago I tried to adapt Sean's fixed point patch [1] resp.
the abandoned attempt [2] to avr-gcc 4.8.  A current version thereof
can be seen in [3], but there are still several issues.

Many thanks, Johann, for the great work that you are doing. (And for
giving the users some involvement.)

My thoughts stimulated by your questions are just a peripheral opinion
- there are likely to be others making more immediate use of fixed point
arithmetic in C.

...

Approach 1) leads to the exact result so that the algorithms can be
sure to round the /exact/ result to get the rounded result (as
required by TR18037 for instance).

Approach 2 is faster but does not offer control over rounding
errors and cannot be used if a saturated result is needed.

While list traffic seems to mostly prefer small code to super fast code
(and I concur), precise code takes precedence over all other
considerations, I feel, especially if the code size saving is less than
a factor of 20.

...

What is the best approach here?

Code that is slower, might consume more flash and stack but
complies to TR18037? Or fast code whose rounding behavior
if not withing the 2 LSBs as of TR18037?

I'm having trouble finding value in an incorrect result generated
rapidly.

Hi Erik, thanks for your annotations.


Correctness is just a natter of the epsilon that you allow for
results to deviate from the exact computation.  So it's rather
about fuzzyness than about correctness.

A code that uses both signed and unsigned versions will
consume ~200 bytes for the multiplications alone.

Sign extension can be performed in three different ways:

1) Explicit before the computation

2) Implicit during the computation

3) Explicit after the computation

[3] currently uses 2) but could reuse the unsigned version and then
consumes 22 bytes by means of 3) like so:

DEFUN __mulsa3
    XCALL   __mulusa3
    tst     B3
    brpl    1f
    sub     C2, A0
    sbc     C3, A1
1:  sbrs    A3, 7
    ret
    sub     C2, B0
    sbc     C3, B1
    ret
ENDF __mulsa3

Thus, if both the signed and the unsigned versions are needed,
the code size will go down by more than 80 bytes.

Less than 120 bytes for both is great.

If only the signed version is used, code size goes up by 20 bytes.

But still only to 120 bytes, AIUI.

What's the best here?

A lean unsigned, and equal size for signed and signed_plus_unsigned,
looks like a good compromise from here, given that it's half of the
worst case.

Ok, I decided to give the original code a face lift.  The signed
version is like above and the unsigned code works with an error
of <= 0.5 LSBs which is better than the original 3 LSBs.
Moreover, it does not need an extra zero register.

FYI, the code is now

;;; (C3:C0) = (A3:A0) * (B3:B0)
;;; Clobbers: __tmp_reg__
;;; Rounding:  -0.5 LSB  <  error  <=  0.5 LSB
DEFUN   __mulusa3
    ;; Some of the MUL instructions have LSBs outside the result.
    ;; Don't ignore these LSBs in order to tame rounding error.
    ;; Use C2/C3 for these LSBs.

    clr C0
    clr C1
    mul A0, B0  $  movw C2, r0

    mul A1, B0  $  add  C3, r0  $  adc C0, r1
    mul A0, B1  $  add  C3, r0  $  adc C0, r1  $  rol C1

    ;; Round
    sbrc C3, 7
    adiw C0, 1

    ;; The following MULs don't have LSBs outside the result.
    ;; C2/C3 is the high part.

    mul  A0, B2  $  add C0, r0  $  adc C1, r1  $  sbc  C2, C2
    mul  A1, B1  $  add C0, r0  $  adc C1, r1  $  sbci C2, 0
    mul  A2, B0  $  add C0, r0  $  adc C1, r1  $  sbci C2, 0
    neg  C2

    mul  A0, B3  $  add C1, r0  $  adc C2, r1  $  sbc  C3, C3
    mul  A1, B2  $  add C1, r0  $  adc C2, r1  $  sbci C3, 0
    mul  A2, B1  $  add C1, r0  $  adc C2, r1  $  sbci C3, 0
    mul  A3, B0  $  add C1, r0  $  adc C2, r1  $  sbci C3, 0
    neg  C3

    mul  A1, B3  $  add C2, r0  $  adc C3, r1
    mul  A2, B2  $  add C2, r0  $  adc C3, r1
    mul  A3, B1  $  add C2, r0  $  adc C3, r1

    mul  A2, B3  $  add C3, r0
    mul  A3, B2  $  add C3, r0

    clr  __zero_reg__
    ret
ENDF __mulusa3


If it matters whether signed multiply takes 120 bytes rather than 100,
then it was time to move up to the next flash size several months ago.
At Siemens we were never allowed to go into production using more than
80% (IIRC) of ROM, and at NEC I liked to follow the same practice.
That way, if there was a software upgrade, there was room for it.
(Anything more is a new product. Even the managers understood that.)

Ya, you are right.  The tools cannot work around inappropriate hardware
selection.

Johann


Thank you Johann, for asking the users.

Erik




reply via email to

[Prev in Thread] Current Thread [Next in Thread]