[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: address@hidden: Re: Regression results need checking ?]
From: |
Ben Pfaff |
Subject: |
Re: address@hidden: Re: Regression results need checking ?] |
Date: |
Sun, 04 Feb 2007 12:28:23 -0800 |
User-agent: |
Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux) |
Jason Stover <address@hidden> writes:
> I can fix the computation of the standardized coefficients, but before
> I do, I have a question. Is there a place where the regression
> procedure can just read the standard deviation for a variable, or must
> it compute the standard deviation itself? And if the regression
> procedure must compute the standard deviation itself, is there
> a single routine somewhere in src that it can use, or does it
> need its own?
>
> The reason I ask is because this test data set has missing data, and
> regression already has its own way of dealing with missing cases. It
> would be nice if there were another standard procedure to call to
> compute descriptive statistics without having to make regression aware
> of yet another way to handle missing data. Computing means, standard
> deviations, and other univariate statistics is a common enough task
> that there should be one place to do it.
This has been on my to-do list for a long time. I agree that it
is important to solve it. I consider it somewhat difficult to
solve because of these factors:
1. Different procedures want to include different data in
the calculations. Some want moments by SPLIT FILE
groups or by other break groups, some want them for
the entire file. Some want to drop user-missing
values, some want to drop even non-missing values
when other variables are missing. So we need a way to
identify and distinguish these different needs.
2. There needs to be a way to detect when the active file
has changed, so that cached calculations can be
dropped. The same mechanism is likely to be useful
for other optimizations if it is general-purpose
enough.
3. We need a good data structure to store all this. I
was thinking about that a while ago and didn't come up
with anything that made me entirely happy, but I'm
sure that a good solution exists.
> So as long as we're on the topic, it might be nice to have a couple of
> routines in src/math to compute such descriptive statistics, and maybe
> even store them in a cache. Would a pool serve this purpose? I guess
> by raising the issue, this means I'm volunteering to do it.
I'm not sure that a pool is really a solution, but it might be
part of one.
I don't think you're obligated to build this. It's an ongoing
issue and I'd rather have a good general-purpose solution than a
hasty one.
--
"Premature optimization is the root of all evil."
--D. E. Knuth, "Structured Programming with go to Statements"