[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Performance of bourn_cast

From: Greg Chicares
Subject: [lmi] Performance of bourn_cast
Date: Sat, 18 Mar 2017 16:17:01 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0

Here are optimized and unoptimized speed measurements for the new cast
added in commit 5e36b0a6bc9fed0788b6f74a9896387c8a729cb4:

$make $coefficiency unit_tests unit_test_targets=bourn_cast_test.exe

  direct: 3.159e-004 s =     315883 ns, mean of 100 iterations
  S to U: 2.334e-002 s =   23343216 ns, mean of  43 iterations
  U to S: 2.501e-002 s =   25008472 ns, mean of  40 iterations

$make $coefficiency build_type=unoptimized CXXFLAGS='-O0' unit_tests 

  direct: 2.282e-003 s =    2281958 ns, mean of 100 iterations
  S to U: 2.772e-002 s =   27718286 ns, mean of  37 iterations
  U to S: 2.683e-002 s =   26828771 ns, mean of  38 iterations

That may be good enough for production use. We're using boost-1.33.1,
which has an "optimized" numeric_cast in a header-only library that
isn't too enormous:

$i686-w64-mingw32-g++ -E 
/opt/lmi/third_party/include/boost/numeric/conversion/cast.hpp -I 
/opt/lmi/third_party/include -DBOOST_STRICT_CONFIG 2>&1 |wc
   7280   19577  203919

but we want to replace it anyway because unknown issues may lurk in
those 203919 bytes. It's a good idea first to test its performance:

diff --git a/bourn_cast_test.cpp b/bourn_cast_test.cpp
index 1732e98..7d774d2 100644
--- a/bourn_cast_test.cpp
+++ b/bourn_cast_test.cpp
@@ -21,7 +21,13 @@
 #include "pchfile.hpp"
-#include "bourn_cast.hpp"
+//#include "bourn_cast.hpp"
+#include <boost/cast.hpp>
+template<typename To, typename From>
+To bourn_cast(From from)
+    return boost::numeric_cast<To>(from);
 #include "miscellany.hpp"               // stifle_warning_for_unused_variable()
 #include "test_tools.hpp"

[Ignore the test failures, all of which indicate only that the boost
implementation throws different exceptions than bourn_cast.]

Without gcc optimization, it performs no better than bourn_cast:

  direct: 2.276e-003 s =    2275670 ns, mean of 100 iterations
  S to U: 2.352e-002 s =   23517581 ns, mean of  43 iterations
  U to S: 2.823e-002 s =   28228958 ns, mean of  36 iterations

However, with gcc optimization, boost::numeric_cast is almost two
orders of magnitude faster, and almost as fast as the "direct" case,
which is a plain, unguarded static_cast:

  direct: 4.187e-004 s =     418724 ns, mean of 100 iterations
  S to U: 4.266e-004 s =     426604 ns, mean of 100 iterations
  U to S: 4.957e-004 s =     495684 ns, mean of 100 iterations

Reverting the patch above, let's try everything we can think of to
hand-optimize bourn_cast:

diff --git a/bourn_cast.hpp b/bourn_cast.hpp
index 87b61bb..6d4dffe 100644
--- a/bourn_cast.hpp
+++ b/bourn_cast.hpp
@@ -80,12 +80,18 @@ To bourn_cast(From from)
     using from_traits = std::numeric_limits<From>;
     static_assert(  to_traits::is_specialized, "");
     static_assert(from_traits::is_specialized, "");
+    static constexpr bool to_is_unsigned = !to_traits::is_signed;
+    static constexpr bool from_is_signed = from_traits::is_signed;
+    static constexpr To lower_bound = to_traits::lowest();
+    static constexpr To upper_bound = to_traits::max();
+    static constexpr bool must_test_lower = from_traits::lowest() < 
+    static constexpr bool must_test_upper = upper_bound < from_traits::max();
-    if(! to_traits::is_signed && from < 0)
+    if(to_is_unsigned && from_is_signed && from < 0)
         throw std::runtime_error("Cast would convert negative to unsigned.");
-    if(from_traits::is_signed && from < to_traits::lowest())
+    if(from_is_signed && must_test_lower && from < lower_bound)
         throw std::runtime_error("Cast would transgress lower limit.");
-    if(to_traits::max() < from)
+    if(must_test_upper && upper_bound < from)
         throw std::runtime_error("Cast would transgress upper limit.");
     return static_cast<To>(from);
 #   if defined __GNUC__

Maybe that helped a little in the middle ("S to U") case, without

  direct: 2.260e-003 s =    2260220 ns, mean of 100 iterations
  S to U: 2.392e-002 s =   23917486 ns, mean of  42 iterations
  U to S: 2.733e-002 s =   27328600 ns, mean of  37 iterations

But it didn't help at all with compiler optimization:

  direct: 3.183e-004 s =     318337 ns, mean of 100 iterations
  S to U: 2.315e-002 s =   23146282 ns, mean of  44 iterations
  U to S: 2.550e-002 s =   25502980 ns, mean of  40 iterations

How can Cacciola have made his reimplementation so much faster?
His proposal "N1879=05-0139" claims to have:

| fixed all the range-checking problems of the initial version and
| even gained compile-time optimizations by totally avoiding the
| range-check when unnecessary.

where "the initial version" means Henney's original, from which
bourn_cast is derived. Yet the hand-optimized bourn_cast certainly
doesn't perform any unnecessary run-time checks.

After pondering this for a long time and rereading Cacciola's paper
again and again, I stumbled upon his secret. SPOILER ALERT--this
astounding secret is revealed below. Retesting bourn_cast with this
amazing improvement applied:

  direct: 2.248e-003 s =    2247836 ns, mean of 100 iterations
  S to U: 2.392e-002 s =   23920379 ns, mean of  42 iterations
  U to S: 2.537e-002 s =   25372050 ns, mean of  40 iterations

  direct: 4.188e-004 s =     418799 ns, mean of 100 iterations
  S to U: 4.192e-004 s =     419169 ns, mean of 100 iterations
  U to S: 4.851e-004 s =     485088 ns, mean of 100 iterations

so now it's as fast as the "optimized" boost version.

Are any of the hand optimizations above worth the complexity they add?
Here's a comparison where each line is the median of five runs, both
using the astonishing speedup technique:

'-O2', "hand optimized" code:
  direct: 4.184e-004 s =     418430 ns, mean of 100 iterations
  S to U: 4.214e-004 s =     421380 ns, mean of 100 iterations
  U to S: 4.920e-004 s =     491967 ns, mean of 100 iterations

'-O2', original commit:

  direct: 4.190e-004 s =     418974 ns, mean of 100 iterations
  S to U: 4.208e-004 s =     420830 ns, mean of 100 iterations
  U to S: 5.151e-004 s =     515096 ns, mean of 100 iterations

There does seem to be a slight improvement. I wouldn't suppose that
caching numeric_traits values in static constexpr statements has any
benefit. Perhaps 'must_test_lower' and 'must_test_upper' actually
do help, though the timing differences are so small that they don't
really prove anything; yet they might make a measurable difference in
some case other than those tested. Vadim, what do you think?

SPOILER: The amazing secret? Add 'inline'. I had supposed that these
days that keyword is mainly useful for avoiding ODR violations, and
that the compiler would perform inlining whenever it would help,
without any hint from me--so I was surprised.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]