avr-gcc-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AVR Benchmark Test Suite [was: RE: [avr-gcc-list] GCC-AVR Register o


From: Dave N6NZ
Subject: Re: AVR Benchmark Test Suite [was: RE: [avr-gcc-list] GCC-AVR Register optimisations]
Date: Sun, 13 Jan 2008 15:19:29 -0800
User-agent: Thunderbird 1.5 (X11/20051201)



Weddington, Eric wrote:

Hi John, Dave, others,

Here are some random thoughts about a benchmark test suite:

- GCC has a page on benchmarks:
<http://gcc.gnu.org/benchmarks/>
However all of those are geared towards larger processors and host
systems. There is a link to a benchmark that focuses on code size,
CSiBE, <http://www.inf.u-szeged.hu/csibe/>. Again, that benchmark is
geared towards larger processors.

This creates a need to have a benchmark that is geared towards 8-bit
microcontroller environments in general, and specifically for the AVR.

What would we like to test?

Code size for sure. Everyone always seems to be interested in code size.
There is an interest in seeing how the GCC compiler performs from one
version to the next, to see if optimizations have improved or if they
have regressed.

Which I would call regression tests, not "benchmarks", per se. Of performance regressions, I would guess that code size regressions under -Os are the #1 priority for the typical user. (A friend is currently tearing his hair out over a code size regression in a commercial PIC C compiler -- he needs to release a minor firmware update to the field... but not even the original code fits his flash any more...)

It's worth drawing a distinction between benchmarks and regression tests. They need to be written differently. A regression test needs to sensitize a particular condition, and needs to be small enough to be debuggable. A benchmark needs to be "realistic", which often makes them harder to debug. I say we need both. The performance regression tests can easily roll into release criteria. A suite of performance benchmarks is more useful as a confirmatory "measure of goodness" -- but actual mysteries in the aggregate score will most likely be chased with smaller tests.

My guess is that existing tests my help us a lot in the benchmark category, but the regression tests will require some elbow grease on our part to get a good set. There's a good chance we can extract good regression tests from existing benchmark-sized tests.

A semi-related question is how many of these tests can be pushed up stream? If we could get a handful of uCtlr-oriented code size regression tests packaged up so that the developers of the generic optimizer could run them as release criteria, it would, I would think, improve the overall quality of gcc for all uCtlr targets.


There is also an interest in comparing AVR compilers, such as how GCC
compares to IAR, Codevision or ImageCraft compilers.

Who is interested? gcc developers, as a means to keep gcc competitive? Or potential users? The former is benchmarking, the latter is moving towards bench-marketing. Not that marketing is bad, but that sort of thing can be a distraction. In any case, the tests that are meaningful here are the benchmark "overall goodness" test suite, not the targeted test suite.


And sometimes there is an interest in comparing AVR against other
microcontrollers, notably Microchip's PIC and TI's MSP430.

Different processor with same compiler? Different processor with best compiler? -- Now this is beginning to sound like SPEC.


Because there are these different interests, it is challenging to come
up with appropriate code samples to showcase and benchmark these
different issues. But we could also implement this in stages, and focus
on AVR-specific code, and GCC-specific AVR code at that.

Clarity of classification is import. Different buckets for different issues.


If we are going to put together a benchmark test suite, like others
benchmarks for GCC (for larger processors), then I would think that it
would be better to model it somewhat after those other benchmarks. I see
that they tend to use publicly available code, and a variety of
different types of applications.

For benchmarking, and bench-marketing, that's a good approach. I'll be redundant and say those are probably not what you want to be debugging. It would make sense for what I'll call a "avr-gcc dashboard". I see a web page with a bunch of bar graphs on it. A summary bar at the top that is the weighted sum of individual test bars. As an avr-gcc user, that kind of summary page would be very useful from one release to the next for setting expectations regarding performance on your own application. As an avr-gcc release master, it's a good dashboard for tracking progress and release worthy-ness.

We should have something similar. Some
suggested projects: FreeRTOS (for the AVR)
Sounds good,
>, uIP (however, we need to
pick a specific implementation of it for the AVR; I have a copy of
uIP-Crumb644),
Another good one

the Atmel 802.15.4 MAC,
Need to check license on that one -- but a good choice otherwise

and the GCC version of the
Butterfly firmware. I also have a copy of the "TI Competitive
Benchmark", which they, and other semiconductor companies, have used to
do comparisons between processors.
Not familiar with it. Also, check the license. Processor manufacturers (like, oh, for instance, *all* the several I have worked for) are very touchy about benchmarks and benchmark publications. My sea charts have a notation: "Here be lawyers".


Does anyone have other suggestions on projects to include in the
Benchmark? One are that seems to be lacking is some application that
uses floating point. Any help to find some application in this area
would be much appreciated.
Yup. Floating point is important, but we could probably make some synthetic benchmarks pretty quickly that were meaningful. Need to watch the data sets, though, since run time can vary greatly once you get into gradual underflow or NaN's and such. Also, remember these may need to run on a simulator, and need to complete in our lifetime.


There needs to be some consensus on what we measure, how we measure it,
what output files we want generated, and hopefully some way to
automatically generate composite results. I'm certainly open to anything
in this area. I would think that we need to be as open as possible on
this, with documentation (minimal, it can be a text file) on what are
our methods, how the results were arrived at, but importantly that the
secondary/generated files be available for others to review and verify
the results.
Agree completely.


On practicalities: I am certainly willing to host the benchmark test
suite on the WinAVR project on SourceForge and use it's CVS repository.
If it is desired to have it in a more neutral place, such as avr-libc,
I'm open to that too, if Joerg Wunsch is willing.
Seems to me that as long as they are publicly available under an appropriate license, it doesn't really matter much who backs them up :)


Thoughts?

Test categories:
1. float v. scalar
2. targeted test v. benchmark v. published dashboard metric
3. member of quick v. extended v. full test list
4. size v. speed

That unrolls into 36 test lists, but the same test may appear multiple times (in both quick and extended, perhaps both size and speed).

As to priorities, IMO the top two priorities are:
1. targeted scalar size
2. targeted scalar speed

Why? To get tests that target specific optimization regressions. A size regression is more painful to an embedded developer than a speed regression. Floating point math is largely in a library so less at risk for a compiler optimization regression.

I'm not saying other things are not important, that's just my take on what to tackle first (after infrastructure, of course.)

-dave

BTW -- having a defined place to put a performance regression test is a good start. Any performance regression that pops up should have a test written for it and cataloged in the framework.


Thanks,
Eric Weddington






reply via email to

[Prev in Thread] Current Thread [Next in Thread]