octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: conv2 performance


From: Jaroslav Hajek
Subject: Re: conv2 performance
Date: Wed, 3 Mar 2010 10:49:50 +0100

On Tue, Mar 2, 2010 at 11:26 AM, Jaroslav Hajek <address@hidden> wrote:
> On Tue, Mar 2, 2010 at 10:56 AM, Jaroslav Hajek <address@hidden> wrote:
>> On Mon, Mar 1, 2010 at 9:46 AM, Jaroslav Hajek <address@hidden> wrote:
>>> On Sun, Feb 28, 2010 at 4:53 PM, John W. Eaton <address@hidden> wrote:
>>>> I recieved a complaint that Octave's conv2 function is about 5 times
>>>> slower than Matlab's.  By making a few simple changes, I was able to
>>>> improve the perfromance some, but not by a factor of 5.  My changes
>>>> are here:
>>>>
>>>>  http://hg.savannah.gnu.org/hgweb/octave/rev/dc8637fd7a76
>>>>
>>>> On my system, I see the following improvement after these changes:
>>>>
>>>>  > n = 1024; m = 64; x = rand(n,n); k = rand(m,m); tic; y = conv2(x,k); toc
>>>>
>>>>  old: Elapsed time is 26.2576 seconds.
>>>>  new: Elapsed time is 13.6104 seconds.
>>>>
>>>> and
>>>>
>>>>  > n = 1024; m = 512; x = rand(n,1); k = rand(m,1); m = rand (m,n);
>>>>  > tic; y = conv2(x,k,m); toc
>>>>
>>>>  old: Elapsed time is 8.53273 seconds.
>>>>  new: Elapsed time is 3.27103 seconds.
>>>>
>>>> Maybe there is still room for improvement here.  I would happily use a
>>>> free software library with a GPL-compatible license to implement this
>>>> function, but I don't know whether one is available.
>>>>
>>>> jwe
>>>>
>>>>
>>>
>>>
>>> I have some ideas for speeding up the conv2 in stock. Using an
>>> existing library would of course be better, but I don't know of a
>>> suitable one. Unless anyone can come up with an existing solution,
>>> I'll try to convert the ideas to code eventually.
>>>
>>>
>>
>> I think John Swensen has a point in that FFT-based convolution is
>> asymptotically better. 4th-order loops can outperform it only for
>> small kernels (the meaning of "small" depending on implementation).
>> Hence I think any loop-based code should primarily target small
>> kernels.
>>
>> I committed an initial implementation of convolutions in liboctave:
>> http://hg.savannah.gnu.org/hgweb/octave/rev/978f5c94b11f
>>
>> summary:
>> This instantiates a specialized code for all kernels up to 8x8 in
>> size. Larger kernels are split into 8x8 blocks.
>> The 4th-order loop is such that the innermost 2 loops have fixed size,
>> hence can be completely unrolled.
>>
>> I used the attached benchmark to measure the time for convolution of a
>> 1000x1000 matrix with all kernel sizes up to 10x10.
>> Times are averaged over 5 runs. Results on my computer (Core 2 Duo @
>> 2.83 GHz, g++ -O3 -march=native) are attached.
>>
>> octave:1> load all_data
>>
>> with a recent tip:
>>
>> octave:2> tocs_oct0
>> tocs_oct0 =
>>
>>   0.012844   0.013554   0.016711   0.022400   0.023357   0.026729
>> 0.030247   0.033534   0.036832   0.040352
>>   0.021440   0.029549   0.039860   0.042910   0.046509   0.043920
>> 0.050186   0.055437   0.060768   0.064602
>>   0.014037   0.022069   0.032884   0.041644   0.046854   0.052915
>> 0.061239   0.067863   0.073653   0.079161
>>   0.015658   0.030140   0.034210   0.045213   0.052253   0.060553
>> 0.069365   0.078562   0.086291   0.093605
>>   0.019656   0.028900   0.045464   0.047990   0.058807   0.066490
>> 0.079297   0.088274   0.097667   0.108786
>>   0.022072   0.033146   0.047572   0.060943   0.070684   0.081349
>> 0.092902   0.100911   0.115715   0.127048
>>   0.021898   0.030693   0.043972   0.058956   0.077249   0.084425
>> 0.100613   0.111503   0.126935   0.141250
>>   0.021071   0.037299   0.050950   0.070723   0.081438   0.101516
>> 0.114788   0.125197   0.141521   0.158815
>>   0.024811   0.036528   0.052532   0.073733   0.095367   0.101299
>> 0.117457   0.132726   0.147585   0.165222
>>   0.024415   0.045528   0.062038   0.077777   0.094293   0.110504
>> 0.127776   0.146341   0.163502   0.178777
>>
>> with the new patch:
>>
>> octave:3> tocs_oct1
>> tocs_oct1 =
>>
>>   0.0074930   0.0093061   0.0101123   0.0059468   0.0067618
>> 0.0073412   0.0095744   0.0111704   0.0155106   0.0155500
>>   0.0056431   0.0061924   0.0084282   0.0121774   0.0137026
>> 0.0166219   0.0200053   0.0235966   0.0279384   0.0284500
>>   0.0060744   0.0084982   0.0126268   0.0165213   0.0221927
>> 0.0273326   0.0332757   0.0390677   0.0434884   0.0455091
>>   0.0063169   0.0110016   0.0172338   0.0247400   0.0311880
>> 0.0387136   0.0465136   0.0544024   0.0592641   0.0642496
>>   0.0066828   0.0141782   0.0221559   0.0309312   0.0406902
>> 0.0501870   0.0602919   0.0492872   0.0548416   0.0618096
>>   0.0084610   0.0162350   0.0308314   0.0393550   0.0504898
>> 0.0641784   0.0500531   0.0585372   0.0652503   0.0736086
>>   0.0105060   0.0235956   0.0373542   0.0465218   0.0594338
>> 0.0500659   0.0609870   0.0713593   0.0790979   0.0899705
>>   0.0148870   0.0233453   0.0382854   0.0579996   0.0480443
>> 0.0571550   0.0695515   0.0805356   0.0899501   0.1026436
>>   0.0159232   0.0315198   0.0468793   0.0580004   0.0528670
>> 0.0641594   0.0777459   0.0901399   0.1043304   0.1173685
>>   0.0191256   0.0280614   0.0450657   0.0671604   0.0608538
>> 0.0769042   0.0929764   0.1040684   0.1214832   0.1341881
>>
>> using Matlab 2007a:
>>
>> octave:4> tocs_matlab
>> tocs_matlab =
>>
>>   0.0062376   0.0139384   0.0180750   0.0221970   0.0220130
>> 0.0257478   0.0305874   0.0344506   0.0383642   0.0428446
>>   0.0109510   0.0266958   0.0351854   0.0458640   0.0620206
>> 0.0698450   0.0828816   0.0961360   0.1055158   0.1111318
>>   0.0134432   0.0272864   0.0418616   0.0544420   0.0615062
>> 0.0739450   0.0873426   0.0982644   0.1111086   0.1225830
>>   0.0184896   0.0378428   0.0543014   0.0658802   0.0816770
>> 0.0981762   0.1139014   0.1303218   0.1464620   0.1628398
>>   0.0219104   0.0428282   0.0660560   0.0862450   0.1027792
>> 0.1222084   0.1425500   0.1629794   0.1833898   0.2025316
>>   0.0265670   0.0535164   0.0782622   0.0981846   0.1218348
>> 0.1462216   0.1700228   0.1945714   0.2187592   0.2437806
>>   0.0340150   0.0575714   0.0851972   0.1174214   0.1405062
>> 0.1686618   0.1965610   0.2246892   0.2526790   0.2803052
>>   0.0348728   0.0733926   0.1020340   0.1292908   0.1611636
>> 0.1934834   0.2251806   0.2569800   0.2883594   0.3213730
>>   0.0423292   0.0779326   0.1099018   0.1503224   0.1818022
>> 0.2175340   0.2541414   0.2897038   0.3259232   0.3611042
>>   0.0422816   0.0899742   0.1258612   0.1622720   0.2018848
>> 0.2427072   0.2823120   0.3228700   0.3632112   0.4034948
>>
>> you can see that this is actually slower in most cases than Octave
>> (even with the previous implementation).
>>
>> The speed-ups (in %) of the new code compared to the old one (more is
>> better, 0 is no speed-up)
>>
>> octave:5> 100 * (tocs_oct0 ./ tocs_oct1 - 1)
>> ans =
>>
>>    71.410    45.645    65.257   276.671   245.425   264.099   215.913
>>  200.206   137.462   159.501
>>   279.935   377.184   372.942   252.378   239.414   164.228   150.863
>>  134.938   117.508   127.072
>>   131.090   159.694   160.428   152.063   111.125    93.595    84.036
>>   73.706    69.362    73.946
>>   147.882   173.962    98.506    82.752    67.542    56.413    49.128
>>   44.409    45.604    45.689
>>   194.121   103.835   105.200    55.150    44.523    32.485    31.522
>>   79.101    78.090    76.002
>>   160.868   104.162    54.297    54.854    39.997    26.754    85.606
>>   72.388    77.340    72.599
>>   108.435    30.080    17.717    26.727    29.975    68.627    64.974
>>   56.256    60.479    56.995
>>    41.537    59.772    33.078    21.937    69.506    77.616    65.040
>>   55.455    57.333    54.725
>>    55.819    15.888    12.059    27.125    80.391    57.886    51.079
>>   47.244    41.459    40.772
>>    27.655    62.245    37.662    15.809    54.951    43.691    37.429
>>   40.620    34.589    33.229
>>
>> apparently (and according to expectations), the small kernels are the
>> most affected ones.
>> I think 8x8 is unnecessarily high (generates 64 functions), I'll
>> probably reduce that to 7x5 or something like that.
>>
>> For comparison, speed-ups compared to Matlab:
>>
>> ans =
>>
>>   -16.755    49.776    78.742   273.261   225.548   250.730   219.470
>>  208.410   147.342   175.528
>>    94.059   331.103   317.473   276.633   352.618   320.198   314.298
>>  307.414   277.673   290.621
>>   121.310   221.083   231.530   229.527   177.146   170.538   162.481
>>  151.523   155.490   169.359
>>   192.703   243.976   215.087   166.290   161.886   153.596   144.878
>>  139.552   147.134   153.449
>>   227.861   202.071   198.142   178.828   152.589   143.506   136.433
>>  230.673   234.399   227.670
>>   213.994   229.637   153.839   149.484   141.306   127.836   239.685
>>  232.389   235.262   231.185
>>   223.769   143.992   128.079   152.401   136.408   236.879   222.300
>>  214.870   219.451   211.552
>>   134.249   214.378   166.509   122.917   235.448   238.524   223.761
>>  219.089   220.577   213.096
>>   165.833   147.250   134.435   159.175   243.886   239.052   226.887
>>  221.394   212.395   207.667
>>   121.073   220.633   179.284   141.619   231.754   215.597   203.638
>>  210.248   198.981   200.693
>>
>> hmmm. Maybe there were some improvements in Matlab since?
>>
>> There are still a few gotchas:
>>
>> First, only the "full" convolution is treated directly, the other
>> cases are extracted from it. I can try to add direct code for the
>> "valid" case in the future.
>>
>> Also, quoting the sources:
>>
>> // FIXME: Only the axpy-form accumulation is used. This is natural for outer
>> // convolution as it requires no boundary treatment.
>> // This typically requires one more memory store per operation, but as the
>> // memory access pattern is equivalent (switching ro and rw parts), I 
>> wouldn't
>> // expect a significant difference. cf. for instance sum(), where row-wise 
>> sum
>> // (axpy-based) is no slower than column-wise (dot-based).
>> // It would be nice, however, if the inner convolution was computed directly 
>> by
>> // dot-based accumulation.
>>
>> Looking at the assembler code generated for oct-convn.cc, I would
>> expect the inner loops to be vectorized, but it doesn't appear so -
>> they're only unrolled. This is a bit of surprise to me because with
>> newer gcc -ftree-vectorize is on with -03. I googled a little and
>> found that aliasing might be the issue, but adding __restrict__
>> doesn't seem to help. One more thing to figure out.
>>
>> regards
>>
>
> In any case, convn results are decisive:
>
> old:
> octave:1> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
> (a, b); toc
> Elapsed time is 10.5321 seconds.
>
> new:
> octave:2> a = rand (100, 100, 100); b = rand (5, 5, 5); tic; c = convn
> (a, b); toc
> Elapsed time is 0.206056 seconds.
>
> I removed the old convn implementation, as the new code appears to be
> much faster.
>
> enjoy
>

More updates: I tried the second of my ideas, and rewrote the conv2
core routines as a sequence of BLAS calls to xAXPY.
I also reordered the outer loops to play more friendly with this approach.
Somewhat surprisingly, I get generally much more plausible numbers,
using the same benchmark as before:

octave:2> tocs
tocs =

   0.0089948   0.0063681   0.0065102   0.0103934   0.0080262
0.0085852   0.0090431   0.0096444   0.0103502   0.0117714
   0.0072873   0.0096438   0.0105184   0.0083142   0.0097248
0.0108656   0.0110642   0.0125764   0.0135653   0.0140663
   0.0066899   0.0075280   0.0108786   0.0133753   0.0110940
0.0128937   0.0142640   0.0152412   0.0170234   0.0176910
   0.0077328   0.0103620   0.0120998   0.0120170   0.0129755
0.0151224   0.0170028   0.0185525   0.0204348   0.0216798
   0.0073034   0.0090684   0.0131918   0.0162982   0.0148558
0.0173875   0.0195776   0.0213011   0.0236975   0.0254478
   0.0084582   0.0119222   0.0143224   0.0148620   0.0169500
0.0197053   0.0228433   0.0241328   0.0275682   0.0297756
   0.0110824   0.0098990   0.0124908   0.0193302   0.0188160
0.0220484   0.0248796   0.0273018   0.0307710   0.0329692
   0.0092370   0.0135680   0.0170529   0.0178676   0.0206368
0.0245049   0.0276646   0.0303240   0.0343668   0.0379746
   0.0118237   0.0113898   0.0148616   0.0224040   0.0231708
0.0295711   0.0335254   0.0342092   0.0406493   0.0436242
   0.0102231   0.0151086   0.0189038   0.0221970   0.0274470
0.0323122   0.0336478   0.0393838   0.0466128   0.0452511

only the smallest kernels are somewhat worse, and not by much. Also,
those tiny numbers seem to behave wildly, so it's not even sure (and
I'd need to recompile to average over more runs). The bigger kernels
seem to be up to 3x faster.

The change is online:
http://hg.savannah.gnu.org/hgweb/octave/rev/5af0b4bb384d

this is actually the simplest axpy-based approach. There may be still
room for optimizations. With larger kernels, it would surely pay off
to split the matrix and do the convolution by blocks. However, as has
been already said, with larger kernels the FFT-based approach is
generally preferable if speed is of any concern, because a loop-based
approach is always O(MA*NA*MB*NB), whereas the FFT-based is better
(O(MA*NA*log(MA*NA) + MB*NB*log(MB*NB)).

regards

-- 
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz



reply via email to

[Prev in Thread] Current Thread [Next in Thread]