Re: casefile.c revision

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: casefile.c revision

From:	John Darrington
Subject:	Re: casefile.c revision
Date:	Fri, 3 Jun 2005 08:33:27 +0800
User-agent:	Mutt/1.5.4i

On Wed, Jun 01, 2005 at 09:29:26PM -0700, Ben Pfaff wrote:

     > 2. Rank &c requires a unique sort --- but it requires a unique sort with
     >    some kind of consolidation function.  By implication, we'd need to
     >    store extra data to "unconsolidate" .  By way of illustration:
     >
     > DATA LIST LIST /x * .
     > BEGIN DATA
     > 3
     > 2
     > 4
     > 4
     > 6
     > END DATA.
     >
     > RANK X 
     >   /TIES=CONSOLIDATE.
     >
     > The first sort needs to produce the sequence 2,3,4,6  and the second
     > must recover the original 3,2,4,4,6 (without loosing any other variables
     > either).

     I don't think it's a big deal. [..snip..]

I chose a bad example.  I think it's a bigger deal than it appears at
first glance.  Hopefully this example illustrates my point better:

Let's say we're doing  RANK X /TIES=MEAN.

and the X data contains '1', '2', '4' and N cases of '3' where N is some
unknown but very large integer.

1. We do our first sort and get:

 X
 -

 1
 2
 3
 .
 .
 . 
 3
 4

2. The ranks are :

 X  R_i
 -  ----
 1  1
 2  2
 3  2 + (N+1)/2
 .  2 + (N+1)/2
 .  2 + (N+1)/2
 .  2 + (N+1)/2
 3  2 + (N+1)/2
 4  3 + N

 but the problem is, that we don't know the value of N until we've
 read ALL the '3's.  Thus, if  N happens to have the value 100,000,000
 then we cannot  write the R_i value without resorting  to a random access.

 I ran into this problem before in the EXAMINE command.  I solved it
 in a rather inefficient way by using a hash table, so EXAMINE will
 probably fail (or be very slow) under this scenario.  I'd like to
 have a re-usable function which solves this in a better way;  It'll
 be needed quite a lot in the NPAR command.

 The only way I can see to solve this problem (given a ban on random
 access) requires the ability to  iterate a casefile in reverse order
 (or else we have to pay the  penalty of another sort).  My algorithm
 would go like this:

 Iterate the casefile in the forward direction, adding the C_i
 variable. C_i is what I call the "aggregated case weight".  Note
 however that for the '3's we cannot know the value of C_i until we
 see the first non '3' (or end of file).  

 So we allocate C_i as follows.  Here - means SYSMIS.  Note that each
 C_i value can only be written *after* we have read the (i+1)th case
 (or seen end-of-file).

 X  C_i
 -  ---
 1  1
 2  1
 3  -
 .  -
 .  -
 .  -
 3  N
 4  1

3.  Now we can iterate the casefile in reverse (or do another sort to
    reverse it).

 X  C_i
 -  ---
 1  1
 2  1
 3  N
 .  N
 .  N
 .  N
 3  N
 4  1

4.  (I think this step can be done in the same iteration as 3 above).
    Next we have to allocate CC_i, the cumulative aggregated weights.

 X  C_i   CC_i
 -  ---   ----
 1  1     1
 2  1     2
 3  N     2+N
 .  N     2+N
 .  N     2+N
 .  N     2+N
 3  N     2+N
 4  1     3+N 

5.  R_i is now calculated as described in SPSS Statisical Algorithms.
    In the case of /TIES=MEAN it's CC_{i - 1} + (C_i + 1)/2

 X  C_i   CC_i   R_i 
 -  ---   ----   ---
 1  1     1      1
 2  1     2      2
 3  N     2+N    2 + (N+1)/2
 .  N     2+N    2 + (N+1)/2
 .  N     2+N    2 + (N+1)/2
 .  N     2+N    2 + (N+1)/2
 3  N     2+N    2 + (N+1)/2
 4  1     3+N    3+N

6.  Finally we have to apply the last sort to restore the original
    order.

So in total we have to do :

   2 sorts
   2 (maybe 3) forward iterations
   1 reverse iteration (or another sort followed by another forward
     iteration).

The questions that I have in the back of my mind are:

a.  What are the implications of having to write a case after having
    read the following case?  If the casefile is on disk and there
    happens to be a page boundary between the two cases of interest,
    then it'll mean unloading the current page and loading up the
    previous one and then going back to the current one .....

b.  Are there any problems with extending case_reader so that it can
    iterate a casefile in reverse order?

c.  Is there an easier way to do this which I've missed completely?

Hope this all makes sense.

John

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

pgplsKuVGwWMg.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

casefile.c revision, John Darrington, 2005/06/01
- Message not available
  - Message not available
    - Re: casefile.c revision, Ben Pfaff, 2005/06/02
    - Re: casefile.c revision, John Darrington <=
    - Re: casefile.c revision, Ben Pfaff, 2005/06/03
    - Re: casefile.c revision, John Darrington, 2005/06/03
    - Re: casefile.c revision, Ben Pfaff, 2005/06/04
    - Re: casefile.c revision, John Darrington, 2005/06/04
    - Re: casefile.c revision, Ben Pfaff, 2005/06/24
    - Re: casefile.c revision, John Darrington, 2005/06/25

Prev by Date: Re: casefile.c revision
Next by Date: Re: casefile.c revision
Previous by thread: Re: casefile.c revision
Next by thread: Re: casefile.c revision
Index(es):
- Date
- Thread