Re: X'*v in octave

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: X'*v in octave

From:	Daniel J Sebald
Subject:	Re: X'*v in octave
Date:	Fri, 22 May 2015 19:01:28 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16

On 05/22/2015 01:21 PM, Rik wrote:

On 05/22/2015 09:00 AM, address@hidden wrote:

Subject:
Re: X'*v in octave
From:
Jordi Gutiérrez Hermoso <address@hidden>
Date:
05/22/2015 07:11 AM

To:
Chih-Jen Lin <address@hidden>
CC:
address@hidden, address@hidden, address@hidden

List-Post:
<mailto:address@hidden>
Content-Transfer-Encoding:
7bit
Precedence:
list
MIME-Version:
1.0
References:
<address@hidden>
In-Reply-To:
<address@hidden>
Message-ID:
<address@hidden>
Content-Type:
text/plain; charset="UTF-8"
Message:
1


On Fri, 2015-05-22 at 05:05 +0800, Chih-Jen Lin wrote:

>  Recently in an experiment I found that for sparse X'*v octave seems
>  to do X' first and then conduct the matrix vector product. This is
>  time consuming as this operation is simple a linear combination of
>  X's columns and X in Octave is stored in compressed column format.

I may be completely mistaken here, but I think that Octave is already
doing what you suggest. If you attempt the following,

     x = sprand(1e5,1e5,0.0001);
     y = rand(1e4,1);

then of the possible operations of interest,

     tic; x*y; toc
     tic; y'*x; toc
     tic; x'*y; toc

the third one is the fastest. Are you seeing one of the two others as
being faster?

- Jordi G. H.


Results from testing with the Mercurial ID 561af1ab6099 on the
development branch:

octave:11>     tic; x*y; toc
Elapsed time is 0.0312591 seconds.
octave:12>     tic; y'*x; toc
Elapsed time is 0.022948 seconds.
octave:13>     tic; x'*y; toc
Elapsed time is 0.0130401 seconds.

It appears that Jordi is right that this case is already significantly
faster than straightforward multiplication of x*y.

--Rik

I see the same types of results, i.e., same relative ratios in terms ofCPU consumption:

tic; x = sprand(1e5,1e5,0.0001); toc

Elapsed time is 0.290463 seconds.

tic; y = rand(1e5,1); toc

Elapsed time is 0.00246286 seconds.

tic; x*y; toc

Elapsed time is 0.0125189 seconds.

tic; y'*x; toc

Elapsed time is 0.010304 seconds.

tic; x'*y; toc

Elapsed time is 0.00771403 seconds.

It's interesting that x'*y is half the amount of time of x*y, and to addthat there is no imaginary element in the sparse construction. So, thisis all a matter of array construct. Whoever programmed the sparseOctave code chose to make the transpose be the more efficient operation.


The explanation for the difference in performance is explained here:

http://web.eecs.utk.edu/~dongarra/etemplates/node381.html
http://web.eecs.utk.edu/~dongarra/etemplates/node382.html
http://web.eecs.utk.edu/~dongarra/etemplates/node383.html

Without looking much at the code, I suspect it boils down to, in onecase this formula:


           y(i) = y(i) + val(j) * x(col_ind(j))

versus this formula:

           y(col_ind(i)) = y(col_ind(i)) + val(i) * x(j)

depending upon the way memory is arrange. The difference between thesetwo formulas is that the first can keep the sum for y(i) in theprocessor's accumulator for every iteration, and the second cannot. Thesecond one has to fetch the accumulation value and store it for everyiteration.

And I suspect that is the case here because things even out a bit if animaginary element is added to the data that matrix/vector product isapplied to, e.g.:

tic; x = sprand(1e5,1e5,0.0001); toc

Elapsed time is 0.286127 seconds.

tic; x = x + i*x; toc

Elapsed time is 0.0944831 seconds.

tic; y = rand(1e5,1); toc

Elapsed time is 0.00269818 seconds.

tic; y = y + i*y; toc

Elapsed time is 0.00172591 seconds.

tic; x*y; toc

Elapsed time is 0.0214262 seconds.

tic; y'*x; toc

Elapsed time is 0.023473 seconds.

tic; x'*y; toc

Elapsed time is 0.0211461 seconds.

I suppose in all examples for the latter scenario the complexmathematics means the accumulator can't be used to great advantagebecause each complex operation has four multiplications and two sums(real/imaginary) to coordinate. A good optimizing compiler knowing howmany registers are available could probably make the complex case veryefficient, but generally optimizing compilers probably aren't that good.

Whether the column/row arrangement was different at one point forOctave/BLAS, I don't know, but perhaps you are using an older version ofeither Octave or BLAS for which x'*y is not the efficient direction.

Dan

[Prev in Thread]

Current Thread

[Next in Thread]

X'*v in octave, Chih-Jen Lin, 2015/05/21
- Re: X'*v in octave, Ben Abbott, 2015/05/21
- Re: X'*v in octave, Jordi Gutiérrez Hermoso, 2015/05/22
- Re: X'*v in octave, Rik, 2015/05/22
- Re: X'*v in octave, Rik, 2015/05/22
  - Re: X'*v in octave, Daniel J Sebald <=

Prev by Date: Re: general package, inputParser, 4.0.0 release
Next by Date: Re: general package, inputParser, 4.0.0 release
Previous by thread: Re: X'*v in octave
Next by thread: Re: [ESA-SoC] Generalized eigenvalue
Index(es):
- Date
- Thread