[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: musings on performance
From: |
Ben Pfaff |
Subject: |
Re: musings on performance |
Date: |
Tue, 09 May 2006 06:58:12 -0700 |
User-agent: |
Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux) |
John Darrington <address@hidden> writes:
> It would be useful to have some benchmark tests, so that we can see
> the effect of each of them. Also benchmarking would be good for
> marketing purposes.
Of course. (But it can't really be done unless/until we actually
implement them.)
> On Mon, May 08, 2006 at 10:14:44PM -0700, Ben Pfaff wrote:
> As a trivial example, imagine that we want the mean of 1e12
> values. The master program could break the values into 100
> chunk of 1e10 values each, and tell 100 other computers to each
> find the mean of a single chunk. Those computers would each
> report back their single value and the master program would in
> turn take the mean of those 100 values, yielding the overall
> mean of all 1e12 values.
>
> Do you have access to a 100 machine cluster to test it?
I have access to a smaller cluster of perhaps 25 machines.
> In step 3, we would use threads plus map-reduce or some other
> parallel programming model to improve performance on symmetric
> multiprocessing (SMP) machines.
>
>
> 4. Take #3 and then extend it, by allowing jobs to be farmed out
> not just to threads but to processes running on other
> computers as well. I won't speculate on how much work this
> would be, but it's clearly a big job.
>
> If you implement #3 using MPI, then there's nothing to be done for #4.
>
> However I dabbled in MPI parallel processing a few years ago, and
> struggled to come up with a real-life problem which was large enough
> to overcome the extra overhead.
MPI might be what we want, but I'm not sure. At this point it's
just speculation.
> Some of the math for statistical procedures would need to be
> paralellised inside gsl --- I don't know if gsl supports parallel
> execution. For example, most matrix operations can be parallelised
> which is worth doing for very large matrices.
Good point.
> I'm acutely aware, that in some places the code performs very badly.
> In particular, operations which involve percentiles are currently
> implemented in a very non-optimal manner, and in fact, will probably
> exhaust memory if passed very large data sets.
Code can always be optimized.
> Also, there are opportunities to cache things that procedures use.
> Eg: most parametric procedures make use of the data's covariance
> matrix. If we can let that persist between procedures, that will
> avoid a lot of calculations being repeated ; just so long as we
> invalidate that cache when appropriate.
Yes, I forgot to put that in my list. It's probably parallel to
item #2.
> It's not going to be much use having a PSPP which can copy data from A
> to B at the speed of light, if any procedures take a year to execute.
I think the goal should really be to avoid copying data where
possible. Presumably, in the multi-machine case, the data should
be stored on a network server or distributed among the machines
on local disks, not copied from a master machine to the many
machines doing the computation.
--
"Mon peu de succès près des femmes est toujours venu de les trop aimer."
--Jean-Jacques Rousseau
Re: musings on performance, Jason Stover, 2006/05/09
Re: musings on performance, Ben Pfaff, 2006/05/15