[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Simulavr-devel] [sr #106678] simulavrxx - performance investigation

From: Petr Hluzin
Subject: [Simulavr-devel] [sr #106678] simulavrxx - performance investigation
Date: Sun, 04 Dec 2011 21:38:57 +0000
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0

Follow-up Comment #1, sr #106678 (project simulavr):

Until today's fix about 54% of time was spent in std::muptimap::insert() and
erase(), most of it in allocations, called by SystemClock::Step(). Removing
the erase() and insert() improves 3.28 MIPS to 9.1 MIPS. On Linux speed rises
from 4.80 MIPS to 7.8 MIPS. (Of course we cannot remove those calls, they are
needed for certain peripherals and multicore, instead I rewrote it using a
minimum heap.)

Commit [2] fixed that, therefore current numbers are 8.2 MIPS (122ns/cycle) on
Windows and 9.4 MIPS on Linux. Measurement conditions are below.

I did not test the C simulavr.

For comparison: simulator "avrtest" [1] does 194.7 MIPS when compiled with
"-O3 -fomit-frame-pointer", 84.8 MIPS when compiled with -O2. Beware the
avrtest was intended to simulate instructions only, not peripherals or
interrupts. Still, we have quite a lot to catch up, it is ~10 times faster
than us.

All tests were done on my regresstimertestdelay.c, which calls avr-lib'c
_delay_loop_2(65535), i.e. sbiw, brne, nop, nop. Both simulavr and avrtest
counts the loop as 4 cycles. It runs with interrupts disabled, does not access
RAM, does not stress cache by Flash accesses. Number of iterations of calls to
_delay_loop_2 is set for test to last several seconds. First launch is used to
warm caches and results discarded. Using default optimization levels (-O2 for
gcc), no profile guided optimization (would require more effort). Measuring
"user" time on Linux, "CPU time" on Windows. Linux builds are on machine named
"u-pl12" using gcc-4.5.3 on Intel Core i7 2.67GHz with 8 MiB cache, Windows
using MSVS 2010 on Intel Core 2 2.5GHz with maybe 2 MiB cache. Unless
otherwise noted there are no other running processes.

I did not use the proposed drystone benchmark because it looks like having
complicated set up and it looks like testing quality of C compiler's
optimizer, not a simulator. For us any access to RAM (static, Y-relative) is
equally fast.



Reply to this item at:


  Message sent via/by Savannah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]