pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Optimisation of statistic calculations


From: Ben Pfaff
Subject: Re: Optimisation of statistic calculations
Date: Wed, 03 Nov 2004 09:55:08 -0800
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

John Darrington <address@hidden> writes:

> As we add more commands to PSPP, there becomes considerable repetition
> of code involving the calculation of common statistics eg, sum ,
> sum-of-squares, variance etc. and an add hoc approach can and has lead
> to the same calculations being unnecessarily repeated, which will make
> PSPP slower than it needs to be.
>
> The way it's going, PSPP is going to bloat, and have large chunks of
> disjoint code, duplicating the same basic algorithms time and time again.
>
> So I'm looking at introducing a framework to allow the optimal reuse
> of statistics already calculated.  This will involve each statistic
> featuring in a DAG, and an engine which traverses the DAG in
> postorder.  In fact, 2 DAGs will be required, because some stats
> require others to be completely precalculated.

I think there are two separate issues here: redundant code and
redundant time.  It is fairly easy, in my experience, to deal
with redundant code.  Most programmers are pretty good at
identifying when the code they're thinking about writing may
already exist and then looking around for it and, if necessary,
generalizing it a little so that it can be used for both
purposes.  We have a few examples of this in PSPP already,
e.g. levene.c, moments.c, sort.c, and of course we're now trying
to take advantage of GSL wherever we can.  I'm pretty sure that
we can continue along this path without much trouble.

Redundant time, i.e. calculating the same thing more than once
for a single variable or data set, seems to me like a separate
issue that needs to be addressed separately.  I haven't thought
about this much at all.  Your DAG sounds like a pretty general
scheme for addressing this.  I assume that it's basically a
dependency graph: to calculate C we need to know A and B.  But
it's not obvious to me why we need two DAGs; a Makefile is the
same kind of dependency graph as far as I can tell and there's
only one DAG involved in that case.  Can you explain further?
-- 
"Platonically Evil Monkey has been symbolically representing the darkest 
 fears of humanity since the dawn of literature and religion, and I think
 I speak for everyone when I give it a sidelong glance of uneasy recognition 
 this evening." --Scrymarch




reply via email to

[Prev in Thread] Current Thread [Next in Thread]