pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Covariance Matrix


From: Jason Stover
Subject: Re: Covariance Matrix
Date: Mon, 28 Sep 2009 12:25:11 -0400
User-agent: Mutt/1.5.18 (2008-05-17)

On Sun, Sep 27, 2009 at 02:44:35PM +0800, John Darrington wrote:
> I've been trying to hack up a reliable CORRELATIONS command.
> It seemed logical to me to use the existing covariance-matrix.c in
> src/math.  Whilst this worked for simple examples, it gave completely
> wrong answers (or at least different from the spss examples I could 
> find) when presented with either a) missing values; or b) non-unity
> caseweights.

This is something I've been meaning to fix, but haven't gotten to it.

> I've started digging into covariance-matrix.c in order to fix the
> problems, but the exercise is slowly evolving into a complete rewrite
> of the module, and in doing so, I'm not confident that I'm not 
> breaking some of the functionality which correlations doesn't use 
> (notably interactions).
> 
> Of course, we could just go with 2 implementations of covariance-matrix
> - one which works with interactions, but not with missing values, and
> the other works with missing values, but doesn't support interactions,
> but I don't think this is a sensible way to proceed.
> 
> As things stand right now, I don't think there's anything in master which 
> actually uses interactions, but I don't know how far Jason has got with
> GLM.

Still working on it, though slowly.

> So I guess the question is what is the best way to proceed?  As I see it,
> the options are:
> 
> a) Fork the covariance matrix implementation and have 2 different ones
> in master - not a good idea I think.
> 
> b) Get a working CM implementation which properly handles missing values 
> and case-weights.   When this is working, we can add support for interactions
> later.
> 
> c) Get the CM working properly with all features, including weights, missing
> values and interactions, before proceeding further.
> 
> 
> My opinion is that we should go for (b), but that's largely motivated by
> personal interests, and I don't know how it's going to affect other
> developers.
> 
> Comments ?

I don't want to drop support for interactions entirely, but I think
the way it's done now is ugly and cumbersome. But it wouldn't be so
ugly if done in two data passes, instead of one. So how about this:
Start working (b), but with an algorithm that works with two data
passes. I'll add the interactions to the two-pass algorithm later.

In fact, the easiest way to handle interactions might be to use a 
simple two-pass algorithm inside another module written specifically
for computing covariances with interactions. That would keep the
actual computation of the covariance cleaner, and push elsewhere
the ugly computations involving the interactions. 

So, the way I'm thinking now is that we should have something like this:

struct interaction_covariance
{
        ...blah blah
        struct covariance_matrix;
};

instead of this (what we have now):

struct covariance_matrix
{
        ...blah blah
        const struct interaction_variable **interactions;
};

Then the code in covariance-matrix.c doesn't need to know anything
about interactions. But it will have to know about categorical
variables.  For a two-pass algorithm, I don't think that will be a big
deal.

Much of the current ugliness in covariance-matrix.c is due to its
single data pass. I thought the benefit of one data pass would offset the
ugliness. But one data pass isn't as stable as two, and the ugly hash
tables and the convoluted code make the current covariance-matrix.c a
wreck.

So I would say: Rewrite away. If possible, please keep support for
categorical variables. But as I said, I don't think that will be hard
to add later for a two-pass algorithm. Ugly hashing will be unnecessary.

I wish I hadn't been so stubborn about having a one-pass algorithm
in the first place. 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]