[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: build problems

From: Ed
Subject: Re: build problems
Date: Tue, 27 May 2008 02:19:27 +0100

I was mostly interested at the start in adding stats functions. For
example I have written various anova/ancova/glm routines to analyse my
own experimental results (they'd have to be rewritten, but at least
I'd only do it once instead of rolling my own every time I make a new
project). I have some stuff like multi-dimensional scaling and various
classifiers lying around too. Oh and I have an implementation of
affinity clustering from the paper in nature last year (I wanted to
try to improve the space bound on their algorithm, but it turned out
they "underplayed" the relative importance of some of the features, so
my efforts to improve the memory usage stalled).

I started by reading src/language/stats/oneway.q to see how the
existing anova was done, but what strikes me is it would be very time
consuming and inefficient to go through 1000+ lines of code (plus more
from preprocessing) for each new anova-like function. Most of the code
here seems to fall into one of:

1/ table generation
2/ argument parsing
3/ statistic generation
4/ infrastructure

Ideally it would all be in 3

For 1, I wonder if some kind of template system would work: have a
template language that you can define table layouts in, with suitable
field names or whatever so that the code can just call
make_table(template_file, heres_my_data).

For 2, I don't really like generating per-command parsers with
preprocessing, unless the parser is very sophisticated. I did work on
a commercial codebase that had a centralised lexer/parser once. To add
a new function as far as I remember you basically defined a new
function token in the parser, and a new routine for it to call
somewhere else. Arguments were handled by stack magic I think; not
that I'm advocating that, but something along these lines is
definitely possible, and reduces the per-function overhead greatly.
Another possibility is bison (which again decouples things).

For 4, it would take someone much more familiar with the codebase to
know how to reduce the amount of marshalling and piping and command
callouts. In seemingly similar situations I've been in before, more
powerful (more specific) driver routines solved these problems (ie the
routines had wrappers that did all the necessary infrastructure
assuming the values had been calculated correctly).

These are just my thoughts at the moment; obviously it'll take me
quite a while to become familiar with the codebase, and my estimates
at the moment are probably uneducated.


2008/5/27 John Darrington <address@hidden>:
> On Mon, May 26, 2008 at 02:56:47AM +0100, Ed wrote:
>     The situation is pretty simple. I thought i'd see if I could
>     contribute to pspp, and the first step is to pull cvs to make sure
>     you're looking at the newest code.
> That's good to hear.  Is there any particular area that you're
> interested in working on?
> J'
> --
> PGP Public key ID: 1024D/2DE827B3
> fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
> See or any PGP keyserver for public key.
> Version: GnuPG v1.4.6 (GNU/Linux)
> iD8DBQFIO1TCimdxnC3oJ7MRApLUAJ0fpQ9CtbxdGzOnN7hK8LpkwkOAiwCbBHoN
> 9IdaqxxEtwtXxzd4JrXo+lk=
> =f23c

reply via email to

[Prev in Thread] Current Thread [Next in Thread]