[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: data sets and caching
Re: data sets and caching
Tue, 1 Nov 2005 22:12:37 +0000
On Mon, Oct 31, 2005 at 03:09:29PM -0800, Ben Pfaff wrote:
> Jason Stover <address@hidden> writes:
> > On Mon, Oct 31, 2005 at 10:25:20AM -0800, Ben Pfaff wrote:
> >> Jason Stover <address@hidden> writes:
> >> > I need to be able to append residuals to the active file
> >> > with a 'save' subcommand. How should I go about this?
> >> Would you like to save them for a single session only,
> >> or should it be possible to save them to disk and retrieve them
> >> in later sessions?
> > Good question. I had intended to save them to the working data file,
> > as the SPSS SAVE subcommand does in its regression procedure. Users
> > mostly like to look at residuals and run tests on them after the model
> > has been fit. But if this working data file is written to disk, the
> > residuals are written with it, and can be used later.
> Ah, so the models would be included as part of the working file
> dictionary? That's a workable idea. (SPSS already does
> something like this, you say?)
Maybe. I'm not sure where to store the 'model object'. Below is
a description of what SPSS does. Residuals would definitely be
in the dictionary.
First, I should draw a distinction between 'residuals' and 'models'
here. Pardon me if I'm saying something everyone already knows.
The original question above was about 'residuals', not the entire
model. The 'model' as an object inside PSPP should be a collection
of estimated parameters, some other information, and, in some cases, one
or more pointers to functions that use those parameters to make predictions.
what belongs in a 'model object' depends on what the 'model' is in
a statistical sense, and how someone might use that model.
The 'residuals' in linear regression are the errors incurred by using the
model to predict the dependent variable. E.g., if we have a variable Y
and fit a regression model with a single explanatory variable X:
Y[i] \approx b0 + b1 * X[i]
where i is the case number, then the residuals are the values
Y[i] - (b0 + b1 * X[i])
and there are as many residuals as cases.
The 'save' subcommand appends the residuals to the current working
file, as a new variable. The residuals aren't really part of a
'statistical model', but some 'model caches' in PSPP should probably
include residuals. And yes, the residuals should be included in the
dictionary, to answer the question above.
About saving model information in the dictionary: Though SPSS does not
save the entire 'model', it has a 'matrix' subcommand. That subcommand
saves some model information either on disk, or in the working file. In
the latter case, the old working file disappears. This subcommand
seems kludgey for two reasons. First, the subcommand does not
save enough relevant information about the model. Second, by failing
to create a nice, reusable data structure 'behind the scene', it does
not allow a user to name a model and use it later.
I don't think we want a model object to overwrite the working file
(except to make PSPP a 'clone'), but we may want to include
a pointer to a model in the dictionary of the data set used
to build that model.
In any case, users will want to be able to refer back to old models,
but not necessarily the data sets used to fit them, and vice versa.
Therefore, the models and data sets/dictionaries should be distinct.
> Should it be possible to save them to and retrieve them from
> separate files? (Maybe the SAVE/XSAVE command could support an
> option that saves models without associated data.)
I think this is a swell idea. Right now, SPSS' OUTFILE subcommand with
the MODEL keyword saves the model information as XML.
Given that PSPP should be able to save a model object, then load it
later, perhaps with an entirely different, inappropriate data set, the
model object should probably store some information from the
dictionary in use at the time the model was created. In particular,
that model object should know if a user asks it to do something
impossible, such as using a string variable in a place where a numeric
SDF Public Access UNIX System - http://sdf.lonestar.org
- Re: data sets and caching,
Jason Stover <=