[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Threads vs. OpenMP in PSPP.
Re: Threads vs. OpenMP in PSPP.
Mon, 02 Apr 2007 08:53:21 -0700
Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)
"John McCabe-Dansted" <address@hidden> writes:
> I was discussing introducing threading into PSPP. Later I though came
> across OpenMP, and I am thinking that OpenMP might be a better choice
> than explicit management of threads for parallelism in PSPP.
I don't think I saw the discussion of adding threading to PSPP.
Was it on pspp-dev?
What is the purpose of adding threading? I can think of two
possibilities: to separate the UI from data processing to improve
the interactive experience, or to break data processing into
multiple threads to increase data processing throughput.
In the latter case, I posted some ideas for data processing
performance improvement on pspp-dev some time ago. The most
realistic of these are filed as bug #17241, but I'm appending the
full contents of my message below. You can find the full
discussion thread in the archives at:
Over the last few days I've been thinking about the performance
story for PSPP. There are lots of low-level
"micro-optimizations" we can do. Mostly, those are not
interesting to me, although some may actually be valuable. In
fact, I've been steadily removing micro-optimizations as I clean
up code, or at least the ones that make the code less readable
for questionable benefit.
That said, there are numerous optimizations that are likely to
have real benefit, that we should consider implementing as time
goes on. The ones that I've come up with so far are below.
They're roughly in order of increasing difficulty and benefit,
and each one tends to depend on the previous.
1. Eliminate the need to cache procedure output. Currently,
every procedure reads the entire active file and writes out a
new copy of it. For big data sets that's often a huge waste
of time and space.
This was almost impossible to get right before I refactored
the code for implementing procedures. Now, it should only be
a few days of work.
2. Implement an optimizer for transformations. SELECT IF can
often be moved earlier without changing any semantics.
Creation of variables whose values are never used can be
dropped entirely. I suspect that this is especially valuable
after #1 is implemented because #1 will retain any
non-temporary transformations from one procedure to the next
for re-execution, whereas currently that does not happen.
Doing a basic job of this should be a few days of work. It's
probably not worth putting more into it than that (and maybe
not that much) unless it is demonstrably useful.
2a. Implement caching of data between procedures.
3. (This is where it gets more interesting, in my opinion.)
Transformations are almost completely parallelizable on a
case-by-case basis. That is, any given case can be
transformed without knowledge of any other case. (Only two
exceptions come to mind, off-hand, which are the LAG function
and the $CASENUM variable in expressions. Probably there are
a few others.)
Furthermore, I suspect that (and Jason can correct me on this)
much statistical analysis is susceptible to efficient parallel
implementation via "map-reduce", which is a two-step model:
i. "Map" - A master program partitions the data set into
arbitrary sized collections of cases ("chunks"). It
hands each chunk to a different CPU or computer, which
processes it in some specified manner and hands the
result (in whatever format) back to the master
ii. "Reduce" - The master program combines the results,
in some manner, to form a single output.
As a trivial example, imagine that we want the mean of 1e12
values. The master program could break the values into 100
chunk of 1e10 values each, and tell 100 other computers to each
find the mean of a single chunk. Those computers would each
report back their single value and the master program would in
turn take the mean of those 100 values, yielding the overall
mean of all 1e12 values.
(Google is responsible for popularizing the map-reduce model.
If you haven't read the paper on it, I'd doing so:
It is inspirational.)
In step 3, we would use threads plus map-reduce or some other
parallel programming model to improve performance on symmetric
multiprocessing (SMP) machines.
This is a good deal of work. The basic implementation of the
map-reduce model and support code (e.g. casefile partitioning
and thread-safe access to data) would take a while. Plus,
we'd have to modify statistical procedures to take advantage
of it. But it could potentially yield a big benefit.
4. Take #3 and then extend it, by allowing jobs to be farmed out
not just to threads but to processes running on other
computers as well. I won't speculate on how much work this
would be, but it's clearly a big job.
If we implemented all of these, or even just through #3, I bet
PSPP would be one of the best things out there for dealing with
large amounts of data.
Let me add that this is all really just a musing. I can't
imagine when I'll get a chance to work on it. Oh, except that #1
is on my list of things of do whenever I get a chance. The
others are just speculative.