Re: regression lib

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regression lib

From:	John Darrington
Subject:	Re: regression lib
Date:	Tue, 3 May 2005 07:31:58 +0800
User-agent:	Mutt/1.5.4i

On Mon, May 02, 2005 at 08:24:57AM -0700, Ben Pfaff wrote:
     John Darrington <address@hidden> writes:
     
     > Currently there's no caching of statistics.  Each procedure
     > calculates them for itself, which is less than ideal because it leads
     > to a lot of duplication. For example group.c largely duplicates
     > factor_stats.c
     
     Hmm.  If so, I think that's probably orthogonal to the caching
     problem.  Is there some reason those files can't share some
     common code to perform their common functionality?  

There's no fundamental reason.  It's just a question of coming up
with the right model to fit the problem (which implies that one has to
understand the problem adequately).  If I'm going to spend the time
refactoring  those two files, then I want to do it in such a way
that'll make implementation of other procedures easier.

     > It's not only mean and stddev.  I can foresee dozens of procedures
     > which need to calculate sst sse etc.   It would be good if
     > applications could just look these values up in a cache.  But there's
     > a lot of issues to consider:
     >
     > * The cache would have to be invalidated every time a transformation
     >   is done.
     
     This is something we'll just have to deal with.  I don't think
     it's too hard.  We just add a `statcache_invalidate(variable)'
     function and call it for the modified variables from every
     transformation that modifies variables, plus a
     `statcache_invalidate_all()' function that invalidates everything
     for procedures that modify the entire file (e.g. MATCH FILES).

Sounds plausible.
     
     > * Caching would be useful not only on complete variables, but also on
     >   subsets of cases.  Eg. variable X, factored by variable Y.  So how
     >   does one define all the posibilities?
     
     I have two ideas:
     
             1. Ignore the problem.  Only cache statistics on complete
                variables.

That's the easiest way.  Will it give sufficient optimisation?  It'll
affect only the most trivial uses of PSPP.
     
             2. Try to handle some special cases as special cases.
                For example, if FILTER BY <VAR> is in effect, then we
                could cache those values as long as FILTER BY <VAR>
                remained in effect and <VAR> was unmodified.

I was thinking more about situations such as 

DATA LIST LIST /A * /B * .
BEGIN DATA.
3.4  1
4.3  2
.
.
END DATA.

T-TEST /GROUPS=B
       /VARIABLES A.

ONEWAY A BY B.

Here the ONEWAY procedure does all the same calculations as the T-TEST
(assuming B takes only 2 values).   But all the data that T-TEST has
calculated is freed when it exits.

     > * Each statistic (eg: mean, stddev) will be different depending upon
     >   the specification of the procedure's /MISSING subcommand.
     
     The most common case is "itemwise" missing with user-missing
     values removed.  We can ignore other cases if we want to.  When
     you're caching, you want to save time in the most common cases.
     If you can save time in other cases, too, that's great, but it's
     not as valuable because they don't come up as much.

Sure it can be done.  We just have to be very carefull that we don't
end up using a cached value where it's not appropriate.
     
     > All these things complicate the implementation and would mean that the
     > potential cache space would quite large.
     
     But you don't reserve space for all of them on each variable.
     You just allocate space as you need it.  Furthermore, because the
     cache is just an optimization, you can throw it, or part of it,
     away if it gets too large.
     
I wasn't thinking so much about the physical memory, but rather the
way in which we would address these cached values given that there are
a lot of parameters needed in order to correctly specify the required
cached datum. 

     I think this came up before and I threw up some of these same
     objections.  They are problems, sure.  But they are problems we
     can deal with and I think we should, sometime post-0.4.0.

I'm not saying that these problems can't be overcome.  But they are
problems which need to be carefully considered.  And yes, if we're
going to do it, then it definately should be after 0.4.0

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

pgpmKLqenoL26.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: regression lib, Jason H. Stover, 2005/05/01
- Re: regression lib, John Darrington, 2005/05/02
  - Re: regression lib, Ben Pfaff, 2005/05/02
    - Re: regression lib, John Darrington <=

Prev by Date: updated documentation
Next by Date: Re: valgrind problems?
Previous by thread: Re: regression lib
Next by thread: Re: Long-name/short-name complexity
Index(es):
- Date
- Thread