[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: X'*v in octave
From: |
Daniel J Sebald |
Subject: |
Re: X'*v in octave |
Date: |
Fri, 22 May 2015 19:01:28 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16 |
On 05/22/2015 01:21 PM, Rik wrote:
On 05/22/2015 09:00 AM, address@hidden wrote:
Subject:
Re: X'*v in octave
From:
Jordi GutiƩrrez Hermoso <address@hidden>
Date:
05/22/2015 07:11 AM
To:
Chih-Jen Lin <address@hidden>
CC:
address@hidden, address@hidden, address@hidden
List-Post:
<mailto:address@hidden>
Content-Transfer-Encoding:
7bit
Precedence:
list
MIME-Version:
1.0
References:
<address@hidden>
In-Reply-To:
<address@hidden>
Message-ID:
<address@hidden>
Content-Type:
text/plain; charset="UTF-8"
Message:
1
On Fri, 2015-05-22 at 05:05 +0800, Chih-Jen Lin wrote:
> Recently in an experiment I found that for sparse X'*v octave seems
> to do X' first and then conduct the matrix vector product. This is
> time consuming as this operation is simple a linear combination of
> X's columns and X in Octave is stored in compressed column format.
I may be completely mistaken here, but I think that Octave is already
doing what you suggest. If you attempt the following,
x = sprand(1e5,1e5,0.0001);
y = rand(1e4,1);
then of the possible operations of interest,
tic; x*y; toc
tic; y'*x; toc
tic; x'*y; toc
the third one is the fastest. Are you seeing one of the two others as
being faster?
- Jordi G. H.
Results from testing with the Mercurial ID 561af1ab6099 on the
development branch:
octave:11> tic; x*y; toc
Elapsed time is 0.0312591 seconds.
octave:12> tic; y'*x; toc
Elapsed time is 0.022948 seconds.
octave:13> tic; x'*y; toc
Elapsed time is 0.0130401 seconds.
It appears that Jordi is right that this case is already significantly
faster than straightforward multiplication of x*y.
--Rik
I see the same types of results, i.e., same relative ratios in terms of
CPU consumption:
tic; x = sprand(1e5,1e5,0.0001); toc
Elapsed time is 0.290463 seconds.
tic; y = rand(1e5,1); toc
Elapsed time is 0.00246286 seconds.
tic; x*y; toc
Elapsed time is 0.0125189 seconds.
tic; y'*x; toc
Elapsed time is 0.010304 seconds.
tic; x'*y; toc
Elapsed time is 0.00771403 seconds.
It's interesting that x'*y is half the amount of time of x*y, and to add
that there is no imaginary element in the sparse construction. So, this
is all a matter of array construct. Whoever programmed the sparse
Octave code chose to make the transpose be the more efficient operation.
The explanation for the difference in performance is explained here:
http://web.eecs.utk.edu/~dongarra/etemplates/node381.html
http://web.eecs.utk.edu/~dongarra/etemplates/node382.html
http://web.eecs.utk.edu/~dongarra/etemplates/node383.html
Without looking much at the code, I suspect it boils down to, in one
case this formula:
y(i) = y(i) + val(j) * x(col_ind(j))
versus this formula:
y(col_ind(i)) = y(col_ind(i)) + val(i) * x(j)
depending upon the way memory is arrange. The difference between these
two formulas is that the first can keep the sum for y(i) in the
processor's accumulator for every iteration, and the second cannot. The
second one has to fetch the accumulation value and store it for every
iteration.
And I suspect that is the case here because things even out a bit if an
imaginary element is added to the data that matrix/vector product is
applied to, e.g.:
tic; x = sprand(1e5,1e5,0.0001); toc
Elapsed time is 0.286127 seconds.
tic; x = x + i*x; toc
Elapsed time is 0.0944831 seconds.
tic; y = rand(1e5,1); toc
Elapsed time is 0.00269818 seconds.
tic; y = y + i*y; toc
Elapsed time is 0.00172591 seconds.
tic; x*y; toc
Elapsed time is 0.0214262 seconds.
tic; y'*x; toc
Elapsed time is 0.023473 seconds.
tic; x'*y; toc
Elapsed time is 0.0211461 seconds.
I suppose in all examples for the latter scenario the complex
mathematics means the accumulator can't be used to great advantage
because each complex operation has four multiplications and two sums
(real/imaginary) to coordinate. A good optimizing compiler knowing how
many registers are available could probably make the complex case very
efficient, but generally optimizing compilers probably aren't that good.
Whether the column/row arrangement was different at one point for
Octave/BLAS, I don't know, but perhaps you are using an older version of
either Octave or BLAS for which x'*y is not the efficient direction.
Dan