octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wrapper for vectorized math libraries


From: John W. Eaton
Subject: Re: Wrapper for vectorized math libraries
Date: Mon, 3 Oct 2011 23:30:51 -0400

On  2-Oct-2011, Dan Davis wrote:

|  However, in this particular case, I now wonder if GCC is following
| the Gnu coding standards, as it has very special flags that enable
| many optimizations with and emit calls specifically to these 2
| libraries.  But that's a discussion for another list.

I don't know the rationale for including the -mveclibabi=X option in
GCC, but I'd be interested in knowing more if you know where that
decision was discussed.

I was curious to know what this option could do, so I tried a few
things.  First I tried compiling

  #include <cmath>
  #include <cstdlib>
  #include <iostream>

  int
  main (void)
  {
    const int n = 10000;

    double x[n], r[n];

    for (int i = 0; i < n; i++)
      x[i] = double (i) / double (n);

    for (int i = 0; i < n; i++)
      r[i] = sin (x[i]);

    for (int i = 0; i < n; i++)
      std::cerr << r[i] << std::endl;

    return 0;
  }

using

  g++ -ftree-vectorize -funsafe-math-optimizations -c -save-temps -O2 
-ftree-vectorizer-verbose=8 -mveclibabi=acml vec2.cc

Looking at the generated assembly, it looks like the second loop is
replaced by a call to __vrd2_sin.  So the vectorization can be done
automatically.  But Octave isn't written exactly like this.  It passes
function pointers to a generic mapping function.  Using

    typedef double (*fptr) (double);

    fptr fcn = sin;

    for (int i = 0; i < n; i++)
      r[i] = fcn (x[i]);

made no difference.  So as long as the function can be inlined and the
pointer value is known at compile time, it seems GCC might be able to
automatically generate calls to the vector library functions.

Octave's code for the mapper function actually look like this:

  template <class U, class F>
  Array<U>
  map (F fcn) const
  {
    octave_idx_type len = length ();

    const T *m = data ();

    Array<U> result (dims ());
    U *p = result.fortran_vec ();

    octave_idx_type i;
    for (i = 0; i < len - 3; i += 4)
      {
        octave_quit ();

        p[i] = fcn (m[i]);
        p[i+1] = fcn (m[i+1]);
        p[i+2] = fcn (m[i+2]);
        p[i+3] = fcn (m[i+3]);
      }

    octave_quit ();

    for (; i < len; i++)
      p[i] = fcn (m[i]);

    return result;
  }

It's inside the Array.h header file, so it could be inlined wherever
it is needed.  I tried a few things with code similar to the following
stripped down version of the above and it appears that the
hand-unrolled loops and the call to octave_quit inside the loop
prevent the automatic vectorization.

The octave_quit function is there so that the loop can be interrupted,
which is something that interactive users normally expect to be able
to do.  In Octave, interrupt handling works by setting a global
variable inside a signal handler, then when it is safe to interrupt
Octave, we throw an exception.  The calls to octave_quit (another
inline function) is where we check to see if an interrupt has happened
and then throw the exception).

But in any case, maybe with some small modifications, we can get the
vectorization to happen automatically.  For example, the following
version of the code will call the vectorized library version of the
sin function:

  #include <cmath>
  #include <cstdlib>
  #include <iostream>

  inline void
  do_mapper (double (*fcn) (double), int len, double *p, const double *m)
  {
    for (int i = 0; i < len; i++)
      p[i] = fcn (m[i]);
  }

  int
  main (void)
  {
    const int n = 10000;
    double x[n], r[n];

    for (int i = 0; i < n; i++)
      x[i] = double (i) / double (n);

    do_mapper (::sin, n, r, x);

    for (int i = 0; i < n; i++)
      std::cerr << r[i] << std::endl;

    return 0;
  }

so it seems we have to remove the signal handling from the function.
If that is acceptable, then you probably don't really need to do
anything other than simplify the Array<T>::map function so that it
does just uses a simple loop (no hand unrolling; no interrupt signal
checking) and then compile with the right options.

It could still be possible to have interrupts work in the loop by
using something like

  BEGIN_INTERRUPT_IMMEDIATELY_IN_FOREIGN_CODE;
  do_mapper (::sin, n, r, x);
  END_INTERRUPT_IMMEDIATELY_IN_FOREIGN_CODE;

These macros are defined in libcruft/misc/quit.h and arrange for
interrupts to be delivered immediately, and to siglongjmp back to the
point of the BEGIN_INTERRUPT_IMMEDIATELY_IN_FOREIGN_CODE macro.  It
should be safe to do this if the code between the BEGIN_/END_ macros
above doesn't allocate any resources.

I'm not sure of the relative cost of checking a global flag variable
inside the loop vs. calling sigsetjmp once outside the loop.  Probably
there is some point at which the single call to sigsetjmp is faster.
So maybe we should consider making this change to some of Octave's
inner loops.  But we must be careful that we don't create resource
leaks in the process, and it would be worth doing some measurements to
see what the effects actually are before making the changes.

In any case, since the vector libraries don't appear to be free
software, it would be a GPL violation to distribute binaries linked to
them.

jwe


reply via email to

[Prev in Thread] Current Thread [Next in Thread]