octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: X'*v in octave


From: Daniel J Sebald
Subject: Re: X'*v in octave
Date: Fri, 22 May 2015 19:01:28 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16

On 05/22/2015 01:21 PM, Rik wrote:
On 05/22/2015 09:00 AM, address@hidden wrote:
Subject:
Re: X'*v in octave
From:
Jordi GutiƩrrez Hermoso <address@hidden>
Date:
05/22/2015 07:11 AM

To:
Chih-Jen Lin <address@hidden>
CC:
address@hidden, address@hidden, address@hidden

List-Post:
<mailto:address@hidden>
Content-Transfer-Encoding:
7bit
Precedence:
list
MIME-Version:
1.0
References:
<address@hidden>
In-Reply-To:
<address@hidden>
Message-ID:
<address@hidden>
Content-Type:
text/plain; charset="UTF-8"
Message:
1


On Fri, 2015-05-22 at 05:05 +0800, Chih-Jen Lin wrote:
>  Recently in an experiment I found that for sparse X'*v octave seems
>  to do X' first and then conduct the matrix vector product. This is
>  time consuming as this operation is simple a linear combination of
>  X's columns and X in Octave is stored in compressed column format.
I may be completely mistaken here, but I think that Octave is already
doing what you suggest. If you attempt the following,

     x = sprand(1e5,1e5,0.0001);
     y = rand(1e4,1);

then of the possible operations of interest,

     tic; x*y; toc
     tic; y'*x; toc
     tic; x'*y; toc

the third one is the fastest. Are you seeing one of the two others as
being faster?

- Jordi G. H.


Results from testing with the Mercurial ID 561af1ab6099 on the
development branch:

octave:11>     tic; x*y; toc
Elapsed time is 0.0312591 seconds.
octave:12>     tic; y'*x; toc
Elapsed time is 0.022948 seconds.
octave:13>     tic; x'*y; toc
Elapsed time is 0.0130401 seconds.

It appears that Jordi is right that this case is already significantly
faster than straightforward multiplication of x*y.

--Rik

I see the same types of results, i.e., same relative ratios in terms of CPU consumption:

tic; x = sprand(1e5,1e5,0.0001); toc
Elapsed time is 0.290463 seconds.
tic; y = rand(1e5,1); toc
Elapsed time is 0.00246286 seconds.
tic; x*y; toc
Elapsed time is 0.0125189 seconds.
tic; y'*x; toc
Elapsed time is 0.010304 seconds.
tic; x'*y; toc
Elapsed time is 0.00771403 seconds.

It's interesting that x'*y is half the amount of time of x*y, and to add that there is no imaginary element in the sparse construction. So, this is all a matter of array construct. Whoever programmed the sparse Octave code chose to make the transpose be the more efficient operation.

The explanation for the difference in performance is explained here:

http://web.eecs.utk.edu/~dongarra/etemplates/node381.html
http://web.eecs.utk.edu/~dongarra/etemplates/node382.html
http://web.eecs.utk.edu/~dongarra/etemplates/node383.html

Without looking much at the code, I suspect it boils down to, in one case this formula:

           y(i) = y(i) + val(j) * x(col_ind(j))

versus this formula:

           y(col_ind(i)) = y(col_ind(i)) + val(i) * x(j)

depending upon the way memory is arrange. The difference between these two formulas is that the first can keep the sum for y(i) in the processor's accumulator for every iteration, and the second cannot. The second one has to fetch the accumulation value and store it for every iteration.

And I suspect that is the case here because things even out a bit if an imaginary element is added to the data that matrix/vector product is applied to, e.g.:

tic; x = sprand(1e5,1e5,0.0001); toc
Elapsed time is 0.286127 seconds.
tic; x = x + i*x; toc
Elapsed time is 0.0944831 seconds.
tic; y = rand(1e5,1); toc
Elapsed time is 0.00269818 seconds.
tic; y = y + i*y; toc
Elapsed time is 0.00172591 seconds.
tic; x*y; toc
Elapsed time is 0.0214262 seconds.
tic; y'*x; toc
Elapsed time is 0.023473 seconds.
tic; x'*y; toc
Elapsed time is 0.0211461 seconds.

I suppose in all examples for the latter scenario the complex mathematics means the accumulator can't be used to great advantage because each complex operation has four multiplications and two sums (real/imaginary) to coordinate. A good optimizing compiler knowing how many registers are available could probably make the complex case very efficient, but generally optimizing compilers probably aren't that good.

Whether the column/row arrangement was different at one point for Octave/BLAS, I don't know, but perhaps you are using an older version of either Octave or BLAS for which x'*y is not the efficient direction.

Dan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]