[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Matrix operations on large datasets

From: Jason Stover
Subject: Re: Matrix operations on large datasets
Date: Sat, 31 May 2008 12:43:06 -0400
User-agent: Mutt/1.5.10i

On Fri, May 30, 2008 at 01:58:17PM -0700, Ben Pfaff wrote:
> Ed <address@hidden> writes:
> > (To avoid polluting the standard thread I copied this part across)
> >
> >> 2008/5/30 Ben Pfaff <address@hidden>:
> >> But I'm not sure why you think that supporting large data sets
> >> has to be highly complex.  A lot of it is just Computer Science
> >> 101 type stuff; for example, merge sorts for efficiently sorting
> >> large data sets (which we already implement).  What do you have
> >> in mind?
> >
> > I guess I was thinking of matrix operations in particular. I've been
> > reading the linear regression code as Jason suggested, and the best
> > way to implement linear algebra on a large dataset isn't immediately
> > clear to me.
> You're right.  We do not have any provisions for out-of-core
> linear algebra.  Whether this is an important omission, I will
> have to defer to others.

Out-of-core linear algebra is a big problem, and I don't know if it's
something we want to deal with. Right now, the regression procedure
reads the entire data set, which is not good for large data sets, so
you would think PSPP should be able to do out-of-core linear algebra.
But there are two ways to make a large data set: Lots of cases, or
lots of variables. They should be handled differently.

If we have lots of cases, but not lots of variables, we don't need any
out-of-core linear algebra - we can use GSL's CBLAS or write our own
procedures if they're simple.  Most statistical procedures don't need
to keep the whole data set around, only sufficient statistics like a
covariance matrix. So if we have a data set with n cases and p
variables, and n is large but p is not, then we can pass the data once
to get the sufficient statistics, and keep those in memory to do the
remaining computations. The covariance matrix, for example, will be p
by p, which is small enough to hold in memory, and contains enough
information to estimate via least squares. If it's an iterative
procedure where the covariance matrix must be updated, we could just
pass through the data again and update the covariance matrix.

If a data set has n cases and p variables, and p is large, then the
covariance matrix will be enormous, and we won't be able to hold it
memory. I still argue against trying to add out-of-core linear algebra
for these reasons:

First, developing all those procedures is going to be a lot of work. Any
good linear algebra package takes a lot of coding and time. Take a look
at GSL's CBLAS or Lapack routines.

Second, out-of-core algorithms will be very slow.

Third, out-of-core algorithms aren't necessary most of the time. They
are only necessary when the number of variables is very large.

Fourth, instances when PSPP would need out-of-core linear algebra are
those in which the user is likely to be doing something they shouldn't
be doing. Why would anyone fit a linear model with 5000 variables?
Many of those variables are correlated, causing conditioning problems
on the matrices, which make computations harder. With 5000 variables
in a linear model, the user probably hasn't bothered to think about
which variables are really explanatory and which just happen to be
there. There would probably be other problems, too. Despite sounding
judgemental here, I do think we should make PSPP able to do the things
users want, when feasible.

But given that such algorithms are slow, not often necessary, hard to
develop and maintain, and likely to be used for inappropriate
analyses, why not just stick with typical algorithms? We can use GSL's
CBLAS all we want, and if we want to optimize a particular procedure,
we can implement an algorithm from Golub and Van Loan.

I believe most procedures will perform fine if they can just pass the
data once and build sufficient statistics along the way. That will
still make PSPP able to handle very large data sets. And if the number
of variables turns out to be a limitation for a lot of users, we can
change the code later anyway.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]