octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bin2dec and cellfun improvements


From: Rik
Subject: Re: bin2dec and cellfun improvements
Date: Sun, 18 Mar 2012 11:25:45 -0700

3/18/12

Daniel,

Here is some extra background.  From the NEWS file describing the changes between 3.4.3 and 3.6.0

 ** All .m string functions have been modified for better performance or
    greater Matlab compatibility.  Performance gains of 15X-30X have
    been demonstrated.  Operations on cell array of strings no longer pay
    quite as high a penalty as those on 2-D character arrays.

I was the author of those changes.  I did extensive benchmarking to tweak crossover points between using different algorithms and to choose an initial algorithm.  Most of the functions were coded to work with ordinary strings and cell array of strings.  The procedure for handling cell array of strings used to be to recursively call the the string function using cellfun.  A typical function layout would be:

function my_string_func (s)

  if (ischar (s))
    ... handle ordinary single string or char matrix ...
  else if (iscellstr (s))
    cellfun (@my_string_func, s)
  else
    print_usage ()
  endif

endfunction

Although quite straightforward, this method always delivered poor performance.  In practice, it turns out to be much faster to convert a cellstring array to a char matrix, manipulate that, and then convert back to a cellstring array if necessary.  The functions in Octave's core are coded for performance first and readability second.  We expect that core functions will be in the tightest inner loops of user's code, and therefore the trade-off of a bit of obtuse coding for 50%, or even an order of magnitude, performance gain is justified.

Also, in terms of optimization, we work with what we are given.  Cell arrays of strings are the "modern" way of handling strings but they are not as efficient internally as char matrices.  Until they reach parity, I have to work with the current landscape which means taking advantage of the speed of indexing operations on char matrices.

> 1) Is a routine like cellfun() something that can be made efficient, or is it inherently a slow routine, not because of internal programming, but because there are things inside a function that are outside of programming control? For example, I wrote the little routine:
>
> function retval = test_bin2dec (d)
> retval = cellfun('binchar2dec', d);
> endfunction
>
> function retval = binchar2dec(s)
> s = s(!isspace(s));
> retval = (s-'0')*(2.^[length(s)-1:-1:0])';
> endfunction
>
> but the "length(s)" is simply something that is slow by nature. It would be nice if 2.^[length(s)-1:-1:0] could be precomputed, but it can't. I assume Rik's routine is speedy because of the padding of zeros and then 2.^[length(s)-1:-1:0] (or something similar) just needs computing once and then used in a matrix multiply.

The cellfun function is just somewhat slow.  Here is some code that I have used for benchmarking the related function arrayfun.

----- Code Segment -----

## Declare size of tests
N = 5;
bm = zeros (N, 1);

for i = 1:N
  ## Clear variables for test and pre-assign array
  clear y;
  y = zeros (size (x));

  tic;
  ##################################################
  ## for loop
#  for j = 1:numel (x)
#    y(j) = tan (j);
#  endfor

#  ## simple Arrayfun
#  y = arrayfun (@tan, x);

  ## HomebrewIndexing
#  y = cos (x) ./ sin (x);

  ## Builtin function
#  y = tan (x);

  ##################################################
  bm(i) = toc;

endfor

----- End Code Segment -----

And here are the results normalized to the builtin function tan

tan () : 1
homebrew : 1.3548
arrayfun : 14.483
for loop : 149.77

As you can see, looping is 2 orders of magnitude slower and is really abysmal.  But that is expected.  On the other hand, using arrayfun is still an order of magnitude slower than calling the function directly.  Similar results obtain for cellfun.

>
> 2) Internally, when a cell array is created, is there some flag or descriptor indicating that the array has a consistent class across the whole array? That is "all_same_class" would indicate the cell array is all "char" or all "double", etc. As an example, the cellstr() routine could, rather than set each cell's class, set the "all_the_same_class" variable and copy the contents of the string matrix. Not much extra work, almost the same as a copy.


As far as I know, there is no such flag.  However, this would be only another reason why cellfun is slow and it doesn't explain why arrayfun is also slow.  With arrayfun, one can guarantee that all of the inputs will be of the same type, and yet it is still slow.

My guess, and I'm just speculating at this point, is that the issue lies in the handle that is passed to arrayfun or cellfun.  This is a polymorphic object that has to do a lot of things.  For example, if it points to an m-file the m-file timestamp needs to be checked and the function might need to be compiled into an internal form.  If we could pass the data directly to a function method I think we would see better results.  This seems to be confirmed by certain operations in cellfun which do directly call C++ functions.  In Octave-3.4.3 these were recognized by passing in the string name of the accelerated function.  Passing in a function handle would not engage the acceleration.  Thus, in 3.4.3

c = cell (1e6, 1);
c(1000) = [1 2 3];
tic; y = cellfun ("isempty", c); toc
Elapsed time is 0.01 seconds.
tic; y = cellfun (@isempty, c); toc
Elapsed time is 1 seconds.

So there are two orders of magnitude lost when a lot of function checking, that probably should be done only once at the start of the loop, is done.


>
> From what I'm seeing of Rik's code--although appearing the best solution given the situation--it seems to me that Octave users are discouraged slightly in a way from using string cells because of the performance hit of convert to/from character matrix. As it stands, there is this flexible cell now, but it only is good for smaller sets of data. On bigger sets of data, say databases, someone might end up programming around cells with the older, more clumsy methods based upon speed. I would think efficiency of cells is an important detail in promoting clean script writing.
>
> 3) There are a number of internal functions that are optimized. Do they work with "scalars" only? Or are there methods that are slightly different for when the input argument is a cell? Going back to question 2, it would be nice if the internal C++ functions could work with cells--especially if they are of "all_same_class"--and chug right through computations without all kinds of redundant class/type checking. Again, I don't know this code real well, just trying to get a feel for things.
>
> 4) Now specifically with regard to base2dec, does it seem like this group of functions should have an internal optimized function that all the others call? For example, is base2dec something that could be written in C++? Rik has optimized things by using matrix multiplication. But written internally, the character strings could be of variable length and in some cases save time when there is no padding. (Also, internally if the base is 2 there is probably even further optimization.) And going back to a previous question, if there were a method for working with cell arrays internally that are "all_same_class", there wouldn't be a need to convert cell arrays to a fixed width character matrix.

There is a hierarchy of improvements when optimizing.  The largest change comes from choosing a better algorithm, say replacing a bubble sort with a quick sort, which can yield 1-3 orders of magnitude improvement.  The next biggest improvement, in Octave, is to do away with looping structures and use vectorization.  This will typically get you 1 order of magnitude.  The next improvement is to change an m-file to a C++ compiled version of the function.  This might get you a 2-3X improvement.

So, base2dec could be recoded into C++.  My benchmarking was showing 1 million 10-digit conversions taking 1.008 seconds.  This could probably be reduced to 0.5 second through recoding in C++.  This seems a small enough benefit that a programmer would be better off to concentrate on other areas of the code such as bug fixes or areas where orders of magnitude improvement are possible.  If you want to go ahead and try, however, no one will turn down fast, debugged code.  It would be very easy to test a different implementation by writing your own C++ function and using the mkoctfile interface.

I'm in the midst of moving so I'm not going to be able to keep up with this thread for several weeks.

Cheers,
Rik


reply via email to

[Prev in Thread] Current Thread [Next in Thread]