pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: covariance test success


From: Jason Stover
Subject: Re: covariance test success
Date: Fri, 20 Nov 2009 16:57:24 -0500
User-agent: Mutt/1.5.18 (2008-05-17)

On Fri, Nov 20, 2009 at 03:44:11PM -0500, Jason Stover wrote:
> So when do we merge? 

Not yet, I think.

I was just looking at what needs to be done to make interactions
possible for the GLM procedure. I also discussed this with Ben
via IRC. 

It seems that adding the interactions is going to be trickier than
just fixing the code in interaction.c. An interaction for us is just a
product of values of two or more variables. So, for example, if var1
and var2 interact, we would need to compute all possible combinations
of values of var1 an var2. Each of these combinations would go into
computing the covariance matrix, just as any other values would.

So an "interaction" must be like a variable, in that it has at least
one column in a covariance matrix.

Next, if var1 and var2 are numeric, a "combination" of their values is
just their product. This is easy to compute as we pass the data. So to
include the interaction of var1 and var2 in the covariance matrix, we
would just make a new variable, pass that to the constructor for the
covariance matrix, and for each case in our data-reading loop, compute
the product of the values of var1 and var2, append that to the case,
and send that case along to covariance_accumulate_pass[12].

The complication enters if var1 is categorical and var2 is
numeric. Then, instead of having bit-vectors as computed in
category.c, we would need the scalar product of the numeric value from
var2, times the bit vector from var1. So for example, if we
encountered var1's value 'a', encoded that as (0 0 1 0), and a 2.2 for
var2, then we would need to use (0 0 2.2 0) in the computation of the
covariance matrix. This raises some obvious questions about what that
interaction should be: It can't be a variable because it has both
categorical and numeric attributes. How should it be appended to the
case being read? How should covariance.c deal with it?

There is a further complication if both var1 and var2 are
categorical. Now we must encode the interaction as a bit vector for
its use in computing the covariance. So for example, if we see 'a' for
var1 and 'b' for var2, we should encode that as, say, (0 0 1 0 0
0 0). Now if we have n categories for var1 and m categories for var2,
then we would have n*m categories for var1 interacting with var2,
which means we would need a bit vector of length n*m - 1 to handle the
interaction between var1 and var2. Where should this be stored? Maybe
some function to smash the two values together and append it to the
case being read? I don't know.

Here is a further complication: The user could specify any number of
variables in an interaction. So instead of var1 interacting with var2,
the user could specify var1, var2,... vark all interacting
together. This would be a bad idea for most experimental designs, but
it is computationally just fine.

So the question of how to make interactions seems difficult because
its answer must involve reading cases, computing new variables, and
encoding vectors from strings and numeric values. I'm asking how to
do this here, because the last time I tried it I made a mess. But it
is important, and necessary for a GLM command and many other modeling
procedures.

Any suggestions? (John, you want to just code this up over lunch?)


> 
> And what to do next? Here is a list of tasks that
> stem from having the new covariance.[ch]:
> 
> 1. Change linreg.c, coefficient.c and regression.q to use the new covariance
> routines. 
> 
> 2. Drop src/data/category.c and covariance-matrix.[ch].
> 
> 3. Rewrite interaction.c to use covariance.c.
> 
> I would prefer to finish a GLM before changing linreg.c too much, but
> I'm afraid doing so will just make more work later. Also, linreg.c
> will have to be changed to use the new covariance struct anyway, and
> doing so without dropping its current behavior of using the entire
> data set would make it a lot uglier in the meantime.
> 
> 
> 
> _______________________________________________
> pspp-dev mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/pspp-dev




reply via email to

[Prev in Thread] Current Thread [Next in Thread]