pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: data sets and caching


From: Ben Pfaff
Subject: Re: data sets and caching
Date: Mon, 31 Oct 2005 10:25:20 -0800
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

Jason Stover <address@hidden> writes:

> I need to be able to append residuals to the active file
> with a 'save' subcommand. How should I go about this?

Would you like to save them for a single session only,
or should it be possible to save them to disk and retrieve them
in later sessions?

[...]

> Here is an example of syntax that shows what users would want
> to be able to do (I'm using hypothetical syntax to illustrate
> the idea):
>
> regression /data=train_data /variables=v0 v1 v2 /statistics default
>        /dependent=v2 /method=enter /name=model1.
>
> nlr /data=train_data /variables=v0 v1 v2 /statistics default /dependent=v2 
> /method=enter /name=model2.
>
> model_compare /data=test_data model1 model2 /criteria ssresid absdev.

Let me see if I understand this.  Please correct me if I am
wrong: REGRESSION and NLR take the same input data (training
data) and fit its structure according to different models.
REGRESSION's model is saved as model1, NLR's model as model2.
Then MODEL_COMPARE compares the effectiveness of these models on
a second set of data (test data), using the saved models.

> This syntax illustrates two design changes that would make pspp more flexible
> for users.
>
> 1. The user can name the output from any procedure.  [...]

This looks good to me.  Do you have a good idea for syntax?  It
would be nice if the syntax were uniform across procedures, so
we'd want a keyword that wasn't already used (much) and ideally
one unique in its first three letters.  "name" seems a little too
generic for that purpose.

If disk storage is possible, then we'd need a way to distinguish
between file names and cache variable names.  (Perhaps file names
are enclosed in quotes and cache variable names are not.)

> I have tried to show how a procedure can create such a cache with
> the regression procedure, which creates a pspp_linreg_cache. If the
> regression procedure gave that cache a name (assuming the user wanted
> to do so), I could just remove the 'pspp_linreg_cache_free (lcache);'
> statement in regression.q and the cache could be used later. A garbage
> collector could free the memory when the cache is no longer needed.

When would the cache no longer be needed?  i.e. do models ever
become invalid?

> 2. Users can name data sets to be used in a procedure. Then PSPP could
> fit models to different data sets and evaluate them using a 'test'
> data set. PSPP could also be made to manipulate multiple data sets
> (such as merging them). SAS users spend a lot of time sorting,
> merging, concatenating and de-duplicating data sets. SPSS does not
> allow this, and that is one reason for SAS' popularity. PSPP's
> inability to do this makes it less attractive to users. I know
> this functionality lies beyond cloning SPSS, but it is functionality
> users find important, and other free statistical software can't do it
> (as far as I know). R names each data set, and it can sort, but users
> cannot combine and de-duplicate data sets as easily as they can with
> SAS. R cannot work with the large data sets that SAS can use, either.

This is the "data" keyword above?  Would this simply be a matter
of supporting multiple, named "active files"?  I think that would
not be a huge amount of work, although it would be kind of tricky
to verify it was correct.  Most of the representation of the
active file is encapsulated in the `dictionary' object, and it
would be possible to add support for multiple instances of other
objects (e.g. the virtual file manager) as necessary.

The work needed is partly clean-ups in the code base that I want
to do anyway.

I don't know whether a "name" keyword on procedures would be
sufficient for this purpose, because transformations that precede
procedure invocation need to know what active file they're
working out of.  That's assuming that the different active files
can have different dictionaries; if their dictionaries are
identical and they just have different data sets, then it
wouldn't be necessary as far as I can tell.

Note that the effect of multiple active files can be partially
simulated with PROCESS IF, e.g.
        PROCESS IF dataset='test'.
        PROCESS IF dataset='training'.
or similarly with TEMPORARY/SELECT IF or COMPUTE/FILTER.

> I know implementing these ideas might be a lot of work, but they would
> make PSPP immensely more useful. I do not think the model-caching is beyond
> the plan for PSPP since most of it (as far as I can see) involves making
> procedures create objects that can be used later. I do not know as
> much about data-shuffling, so I can't comment on that.

It sounds like a good goal to me.
-- 
"Implementation details are beyond the scope of the Java virtual
 machine specification.  One should not assume that every virtual
 machine implementation contains a giant squid."
--"Mr. Bunny's Big Cup o' Java"




reply via email to

[Prev in Thread] Current Thread [Next in Thread]