[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[avr-gcc-list] Speeding [u]ltoa by 35%
From: |
Georg-Johann Lay |
Subject: |
[avr-gcc-list] Speeding [u]ltoa by 35% |
Date: |
Tue, 19 Jun 2012 09:00:12 +0200 |
User-agent: |
Thunderbird 2.0.0.24 (Windows/20100228) |
Reviewing the ltoa sources [0] shows that the implementation's
speed is not award winning: it uses 32-bit div+mod.
There was propose [1] to speed ltoa by means of LUT but that
approach tries to use lookup tables for special radices.
The new proposal inlines __divmodsi4 leading to an ~35% tick gain
without increasing code size:
#define VAL 22
#define STR 20
#define RADIX 18
#define REST 26
#define CNT 19
DEFUN my_ltoa
movw r30, STR
mov r27, STR+1
mov TMP, STR
cpi RADIX, 36+1
cpc RADIX+1, ZERO
brsh 99f
cpi RADIX, 2
brlo 99f
clt
cpi RADIX, 10
brne 1f
bst VAL+3, 7
brtc 1f
com VAL+3
com VAL+2
com VAL+1
neg VAL
sbci VAL+1, -1
sbci VAL+2, -1
sbci VAL+3, -1
1: clr REST
ldi CNT, 32
2: lsl VAL
rol VAL+1
rol VAL+2
rol VAL+3
rol REST
cp REST, RADIX
brlo 3f
sub REST, RADIX
inc VAL
3: dec CNT
brne 2b
subi REST, -'0'
cpi REST, '9'+1
brlo 4f
subi REST, '0'-'a'+10
4: st Z+, REST
sbiw VAL+2, 0
sbci VAL+1, 0
sbci VAL, 0
brne 1b
brtc 99f
ldi CNT, '-'
st Z+, CNT
99:
mov r24, TMP
mov r25, r27
st Z+, ZERO
jmp strrev
ENDF my_ltoa
Some notes on the code:
* The code is not yet documented or ready to be dropped into
avr-libc sources (coding style, parametrize hardware)
* Speed gain 34%..39% see [1] and [2] for detail analysis
* Code size is same as old implementation
* Won't drag in __divmodsi4, thus code size decrease of > 100
bytes if application otherwise does not need 32-bit div
* Less stack usage
* Strict radix checking: original version is not strict, e.g.
0x100 + 10 will slip through
* Code size can be decreased further by moving the sanity
check to C like so:
static inline __attribute__((always_inline))
char* ltoa (long x, char *str, int radix)
{
if (radix < 2 || radix > 36)
{
*str = '\0';
return str;
}
else
{
extern char* ltoa_asm (long, char*, unsigned char);
return ltoa_asm (x, str, radix);
}
}
Presumably almost all use cases use radix known at compile time.
Thus, the compiler can optimize away the sanity check. Moreover,
radix can be passed as 8-bit value: no need to pass 10 as 0x000a.
More notes:
* Similar improvement can be used for ultoa
* Code between [u]ltoa could be shared.
* There is ultoa_invert [4] that
- str[] will NOT 0 terminated.
- Sequence of digits is inverted.
- It returns pointer to first byte after a string.
All these is readily available in ltoa, too. Again, code could
be shared.
A similar method could be used to speed to utoa/itoa with a speedup
of > 20%. But it is harder to work against code size increase,
and without the help of C inlining the code increases by ~10-16 bytes.
Again, code increase is no issue if the application does not need
16-bit divmod, as current i/ltoa will then drag that code.
Johann
--
[0] AVR-Libc: ltoa.S
http://svn.savannah.nongnu.org/viewvc/trunk/avr-libc/libc/misc/ltoa.S?revision=1944&root=avr-libc&view=markup
[1] avr-libc: optimization for ltoa/printf
http://lists.gnu.org/archive/html/avr-libc-dev/2011-06/msg00000.html
[2] Plot:
http://sourceforge.net/apps/mediawiki/mobilechessboar/nfs/project/m/mo/mobilechessboar/4/48/Ltoa.png
[3] Plot Description
http://sourceforge.net/apps/mediawiki/mobilechessboar/index.php?title=File:Ltoa.png#Description
[4] AVR-Libc: ultoa_invert.S
http://svn.savannah.nongnu.org/viewvc/trunk/avr-libc/libc/stdio/ultoa_invert.S?revision=2191&root=avr-libc&view=markup
- [avr-gcc-list] Speeding [u]ltoa by 35%,
Georg-Johann Lay <=