[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
recoding categorical variables
recoding categorical variables
Wed, 22 Jun 2005 16:50:57 +0000
I'm in the middle of regression.q and an issue has come up
which is important to any modeling routines in PSPP: Recoding
categorical variables into vectors.
Modeling procedures require categorical variables to be recoded as
vectors with binary entries. Such procedures include regression,
generalized linear models, multivariate procedures, and just about
every other procedure that handles categorical data. During
estimation, these vectors are stored as sub-rows of a matrix.
To illustrate the idea, here is an example from a regression problem: Say
Y = b0 + b1 * x1 + (an effect due to a categorical variable x2 with 3
cateogires) + er,
where er is normally distributed 'noise', x1 and x2 are known independent
variables and b0, b1 and 'some effect for x2' are unknown.
Now let's say x1 is a numeric variable, and x2 can take values 'a', 'b' or 'c'.
To find the least squares estimate of b0, b1 and 'some effect for x2', the model
is made explicit this way:
Y = b0 + b1 * x1 + b2 * z1 + b3 * z2 + er
where z1 is 1 if x2 is 'a' and 0 otherwise, z2 is 1 if x2 is 'b' and 0
so the value of 'c' corresponds to z1 and z2 both being 0. Now if we have
a data set like this:
Y x1 x2
2.3 -1 a
2.1 -3 a
1.2 -2.3 b
1.9 -1.9 b
2.1 -1 c
2.2 -2 c
The least-squares estimates for b0, b1, b2 and b3 are found by making
this 'design matrix':
1 -1 1 0
1 -3 1 0
1 -2.3 0 1
1 -1.9 0 1
1 -1 0 0
1 -2 0 0
The first column of the matrix corresponds to b0, the second records
the values of x1 and the last two columns record the value of x2. Since
these computations require matrix inversion and other matrix operations,
I have used gsl_matrix data type so far.
After the least squares fit, to report results we must transform back
from the design matrix to the original values of x2 ('a', 'b' or 'c').
This means we need a way to jump back and forth between the original
categorical variable and its sub-row within the gsl_matrix. And there
could be many categorical variables corresponding to different
collections of sub-rows. I know oneway.q uses hashing, but a oneway
anova requires only computation of means rather than the more general
matrix formulation required for regression or other procedures. So it
will be necessary to create recoding routines that do more than
hashing. Those routines should, if possible, be made to work with a
regression procedure, a mixed linear models procedure, a multivariate
procedure, a generalized linear models procedure, and any other
procedure which must encode categorical variables as matrices (which
means many, many procedures).
This doesn't seem like something difficult to write, but I thought I
would mail the list since it is something with implications beyond
regression.q. I'm working on routines to recode categorical values to
vectors and keep track of which sub-row of a gsl_matrix corresponds to
which categorical variable. If anyone has any comments, or if someone
else wants to write such routines, let me know.
SDF Public Access UNIX System - http://sdf.lonestar.org
- recoding categorical variables,
Jason Stover <=