[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Optimisation of statistic calculations
From: |
Ben Pfaff |
Subject: |
Re: Optimisation of statistic calculations |
Date: |
Wed, 03 Nov 2004 09:55:08 -0800 |
User-agent: |
Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux) |
John Darrington <address@hidden> writes:
> As we add more commands to PSPP, there becomes considerable repetition
> of code involving the calculation of common statistics eg, sum ,
> sum-of-squares, variance etc. and an add hoc approach can and has lead
> to the same calculations being unnecessarily repeated, which will make
> PSPP slower than it needs to be.
>
> The way it's going, PSPP is going to bloat, and have large chunks of
> disjoint code, duplicating the same basic algorithms time and time again.
>
> So I'm looking at introducing a framework to allow the optimal reuse
> of statistics already calculated. This will involve each statistic
> featuring in a DAG, and an engine which traverses the DAG in
> postorder. In fact, 2 DAGs will be required, because some stats
> require others to be completely precalculated.
I think there are two separate issues here: redundant code and
redundant time. It is fairly easy, in my experience, to deal
with redundant code. Most programmers are pretty good at
identifying when the code they're thinking about writing may
already exist and then looking around for it and, if necessary,
generalizing it a little so that it can be used for both
purposes. We have a few examples of this in PSPP already,
e.g. levene.c, moments.c, sort.c, and of course we're now trying
to take advantage of GSL wherever we can. I'm pretty sure that
we can continue along this path without much trouble.
Redundant time, i.e. calculating the same thing more than once
for a single variable or data set, seems to me like a separate
issue that needs to be addressed separately. I haven't thought
about this much at all. Your DAG sounds like a pretty general
scheme for addressing this. I assume that it's basically a
dependency graph: to calculate C we need to know A and B. But
it's not obvious to me why we need two DAGs; a Makefile is the
same kind of dependency graph as far as I can tell and there's
only one DAG involved in that case. Can you explain further?
--
"Platonically Evil Monkey has been symbolically representing the darkest
fears of humanity since the dawn of literature and religion, and I think
I speak for everyone when I give it a sidelong glance of uneasy recognition
this evening." --Scrymarch