[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Robust timings in unit tests [Was: Using auto-vectorization]

From: Greg Chicares
Subject: [lmi] Robust timings in unit tests [Was: Using auto-vectorization]
Date: Sun, 7 May 2017 17:49:29 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0

On 2017-01-24 02:49, Vadim Zeitlin wrote:
> On Tue, 24 Jan 2017 02:26:12 +0000 Greg Chicares <address@hidden> wrote:
> GC> On 2017-01-23 21:26, Vadim Zeitlin wrote:
> GC> TimeAnAliquot() returns the mean, and whenever I study its
> GC> results deeply I wish it returned a more robust statistic such as
> GC> the median or mode, but those would require storing the time spent
> GC> on each iteration,
>  For the standard deviation it wouldn't, I'm almost sure that I've already
> posted the code I use for computing on the fly implementing the algorithm
> given in the section 4.2.2 of "TAoCP, Volume 2: Seminumerical Algorithms".
> GC> and the resulting overhead might perturb timings of very simple
> GC> operations like the N=1 tests here; however, those reservations make
> GC> less sense now that we have excess RAM than they did at the beginning
> GC> of the lmi epoch. See:
> GC> 
> GC> 
> http://stackoverflow.com/questions/10930732/c-efficiently-calculating-a-running-median

I returned to this issue recently, because time and again I've tried
to compare unit-test timings to judge the effect of a code change, and
found it necessary to run a test several times, e.g.:
| Here are median-of-three timings
It would be much more convenient to get a robust measurement in one
step. I summarize my efforts below, mainly to show that everything I
tried doesn't work so that we don't go down this road again.

Starting with the hypothesis that the mean is skewed by outliers,
while a more robust estimator such as the median might not be, I
looked into how we might calculate the median of a running series.
Here's an interesting idea...

refers to
quickselect finds running median in O(n) time
  but too slow for large datasets
alternatively, skiplist: O(log(n))

...but that's too difficult a way to test the hypothesis, so I
considered this idea:

modify this:
  mean += eta * (sample - mean)
to obtain median:
  median += eta * sgn(sample - median)

| set eta from a running estimate of the absolute deviation:
| for each new value sample, update cumadev += abs(sample-median). Then set
| eta = 1.5*cumadev/(k*k), where k is the number of samples seen so far.

That's attractively simple, but I found no theoretical support for it
(in sharp contrast to the known algorithm for mean and variance), so
I rethought the problem. Usually we see "mean of 100 iterations" in
the output, because AliquotTimer<> stops after a hundred trials when
measuring all but the fastest functions. Storing a dataset of 100
values is no big deal, and recording each datum in an array doesn't
take enough time to perturb any really interesting measurement. In
the (rare) case where more than 100 trials are performed, for this
experiment I though it sufficient to store the last 100--see the
experimental patch below [0].

With that patch, I ran 'timer_test' several times. Here I present only
the first three lines of output (which measure actual calculations),
omitting the rest (which measure sleep() and variations thereon).
Slightly reformatted results of three cherry-picked runs out of ten:

                                              lowest  median  highest

  7.063e-08 s =    71 ns, mean of 141577 runs      0       0     1000
  8.138e-06 s =  8138 ns, mean of    123 runs   8000    8000     9000
  8.534e-05 s = 85340 ns, mean of    100 runs  80000   81000   143000
  7.000e-08 s =    70 ns, mean of 142852 runs      0       0     1000
  8.163e-06 s =  8163 ns, mean of    123 runs   8000    8000     9000
  6.893e-05 s = 68930 ns, mean of    100 runs  40000   81000    99000
  7.019e-08 s =    70 ns, mean of 142477 runs      0       0     1000
  8.113e-06 s =  8113 ns, mean of    124 runs   8000    8000     9000
  4.285e-05 s = 42850 ns, mean of    100 runs  40000   41000    88000

Look at every third row. If I run this repeatedly, I get an average
in the neighborhood of ~40000 ns, with occasional outlying averages
up to ~80000 ns. Generally, when reporting a median-of-three timing,
if I see such outliers, I discard all results and use three new runs.
It must be emphasized that these outliers are not individual timings,
but rather averages calculated over a small fraction of a second. The
median actually deviates from the long-term true average of ~40000 by
more than the mean in the first two runs above.

This is enough to reject the hypothesis. The mean hasn't been skewed
by a few outlying observations in the set of 100; instead, the entire
set of 100 has been skewed by something else--probably a transient in
the operating system. Therefore, I tried:

$sudo nice --19 make $coefficiency build_type=native CXXFLAGS='-O2 -ggdb -m64 
-Wno-pedantic' CFLAGS='-m64' LDFLAGS='-m64' unit_tests 
unit_test_targets=timer_test 2>&1 |less -S

which I hoped might mitigate that problem. Alas, as these two
consecutive trials cherry-picked from a series of ten show...

                                              lowest  median  highest

  7.027e-08 s =    70 ns, mean of 142318 runs      0       0     1000
  8.281e-06 s =  8281 ns, mean of    121 runs   8000    8000    17000
  6.133e-05 s = 61330 ns, mean of    100 runs  40000   65000   106000
  7.027e-08 s =    70 ns, mean of 142305 runs      0       0     1000
  8.163e-06 s =  8163 ns, mean of    123 runs   8000    8000     9000
  4.904e-05 s = 49040 ns, mean of    100 runs  40000   41000   111000

...even twiddling 'niceness' doesn't overcome the problem. (Therefore,
I won't even try to use 'nice' with 'wine', which looks nontrivial.)

Conclusion: unless we install a real-time kernel, there's no way to
overcome the inherent instability except by averaging trials that are
deliberately spaced some seconds apart, discarding outliers (as I've
laboriously been doing all along).

An argument might be made for reporting the lowest measurement rather
than the mean. Provided that a function takes much longer than the
clock resolution we're using, it seems reasonable to imagine that the
"true" timing is that minimum, and each observation is that "true"
value plus some *positive* amount of noise, assuming that "noise" can
never be negative--which seems open to question (what if the high-
resolution timer is "noisy"?). Yet observe that, in the first of the
five sets of results reported above, the minimum on the third line is
almost as far from the long-term average as the mean or the median.


[0] "the experimental patch below"

Having captured 100 data, we could use something like the BFPRT
median-of-medians estimator, but it's easier to compute the actual
median--and this is done after the data have been captured, so the
efficiency of the median calculation doesn't matter. The right tool
is std::nth_element(), but for this experiment I used std::sort() to
guard against the possibility of a foolish coding mistake because I
hadn't used nth_element() before.

Here, <array> seems more natural, but
    std::array<double,100> metages = {0};
didn't seem to initialize everything to zero as I would expect with a
C array, so I used std::vector instead.

diff --git a/timer.hpp b/timer.hpp
index 4f3a6fa..d633a4e 100644
--- a/timer.hpp
+++ b/timer.hpp
@@ -40,6 +40,8 @@
     typedef std::clock_t elapsed_t;
 #endif // Unknown platform.
+#include <array>
+#include <algorithm>                    // std::nth_element()
 #include <iomanip>
 #include <ios>
 #include <ostream>
@@ -201,6 +203,8 @@ AliquotTimer<F>& AliquotTimer<F>::operator()()
         return *this;
+//  std::array<double,100> metages = {0};
+    std::vector<double> metages(100);
     Timer timer;
     double const limit = max_seconds_ * static_cast<double>(timer.frequency_);
     double const start_time = static_cast<double>(timer.time_when_started_);
@@ -208,15 +212,22 @@ AliquotTimer<F>& AliquotTimer<F>::operator()()
     double const expiry_max = start_time +        limit;
     int j = 0;
-        (elapsed_t now = 0
+        (elapsed_t now = timer.inspect(), last = timer.inspect()
         ;now < expiry_min || j < 100 && now < expiry_max
-        ;++j, now = timer.inspect()
+        ;++j
+        last = now;
+        now = timer.inspect();
+        metages[j % 100] = now - last;
     unit_time_ = timer.elapsed_seconds() / j;
+    std::nth_element(metages.begin(), metages.begin() + 50, metages.end());
+    // in case that was miscoded:
+    std::sort(metages.begin(), metages.end());
+    double adj = static_cast<double>(timer.frequency_);
     std::ostringstream oss;
         << std::scientific << std::setprecision(3)
@@ -227,6 +238,10 @@ AliquotTimer<F>& AliquotTimer<F>::operator()()
         << " ns, mean of "
         << std::setw( 3) << j
         << " iterations"
+        // 51st of 100 is not quite the median...
+        << " " << std::setw(10) << 1.0e9 * metages[ 0] / adj
+        << " " << std::setw(10) << 1.0e9 * metages[50] / adj
+        << " " << std::setw(10) << 1.0e9 * metages[99] / adj
     str_ = oss.str();

reply via email to

[Prev in Thread] Current Thread [Next in Thread]