[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: casefile.c revision
From: 
John Darrington 
Subject: 
Re: casefile.c revision 
Date: 
Fri, 3 Jun 2005 08:33:27 +0800 
Useragent: 
Mutt/1.5.4i 
On Wed, Jun 01, 2005 at 09:29:26PM 0700, Ben Pfaff wrote:
> 2. Rank &c requires a unique sort  but it requires a unique sort with
> some kind of consolidation function. By implication, we'd need to
> store extra data to "unconsolidate" . By way of illustration:
>
> DATA LIST LIST /x * .
> BEGIN DATA
> 3
> 2
> 4
> 4
> 6
> END DATA.
>
> RANK X
> /TIES=CONSOLIDATE.
>
> The first sort needs to produce the sequence 2,3,4,6 and the second
> must recover the original 3,2,4,4,6 (without loosing any other variables
> either).
I don't think it's a big deal. [..snip..]
I chose a bad example. I think it's a bigger deal than it appears at
first glance. Hopefully this example illustrates my point better:
Let's say we're doing RANK X /TIES=MEAN.
and the X data contains '1', '2', '4' and N cases of '3' where N is some
unknown but very large integer.
1. We do our first sort and get:
X

1
2
3
.
.
.
3
4
2. The ranks are :
X R_i
 
1 1
2 2
3 2 + (N+1)/2
. 2 + (N+1)/2
. 2 + (N+1)/2
. 2 + (N+1)/2
3 2 + (N+1)/2
4 3 + N
but the problem is, that we don't know the value of N until we've
read ALL the '3's. Thus, if N happens to have the value 100,000,000
then we cannot write the R_i value without resorting to a random access.
I ran into this problem before in the EXAMINE command. I solved it
in a rather inefficient way by using a hash table, so EXAMINE will
probably fail (or be very slow) under this scenario. I'd like to
have a reusable function which solves this in a better way; It'll
be needed quite a lot in the NPAR command.
The only way I can see to solve this problem (given a ban on random
access) requires the ability to iterate a casefile in reverse order
(or else we have to pay the penalty of another sort). My algorithm
would go like this:
Iterate the casefile in the forward direction, adding the C_i
variable. C_i is what I call the "aggregated case weight". Note
however that for the '3's we cannot know the value of C_i until we
see the first non '3' (or end of file).
So we allocate C_i as follows. Here  means SYSMIS. Note that each
C_i value can only be written *after* we have read the (i+1)th case
(or seen endoffile).
X C_i
 
1 1
2 1
3 
. 
. 
. 
3 N
4 1
3. Now we can iterate the casefile in reverse (or do another sort to
reverse it).
X C_i
 
1 1
2 1
3 N
. N
. N
. N
3 N
4 1
4. (I think this step can be done in the same iteration as 3 above).
Next we have to allocate CC_i, the cumulative aggregated weights.
X C_i CC_i
  
1 1 1
2 1 2
3 N 2+N
. N 2+N
. N 2+N
. N 2+N
3 N 2+N
4 1 3+N
5. R_i is now calculated as described in SPSS Statisical Algorithms.
In the case of /TIES=MEAN it's CC_{i  1} + (C_i + 1)/2
X C_i CC_i R_i
   
1 1 1 1
2 1 2 2
3 N 2+N 2 + (N+1)/2
. N 2+N 2 + (N+1)/2
. N 2+N 2 + (N+1)/2
. N 2+N 2 + (N+1)/2
3 N 2+N 2 + (N+1)/2
4 1 3+N 3+N
6. Finally we have to apply the last sort to restore the original
order.
So in total we have to do :
2 sorts
2 (maybe 3) forward iterations
1 reverse iteration (or another sort followed by another forward
iteration).
The questions that I have in the back of my mind are:
a. What are the implications of having to write a case after having
read the following case? If the casefile is on disk and there
happens to be a page boundary between the two cases of interest,
then it'll mean unloading the current page and loading up the
previous one and then going back to the current one .....
b. Are there any problems with extending case_reader so that it can
iterate a casefile in reverse order?
c. Is there an easier way to do this which I've missed completely?
Hope this all makes sense.
John

PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
pgplsKuVGwWMg.pgp
Description: PGP signature
 casefile.c revision, John Darrington, 2005/06/01
 Message not available
 Message not available
 Re: casefile.c revision, Ben Pfaff, 2005/06/02
 Re: casefile.c revision,
John Darrington <=
 Re: casefile.c revision, Ben Pfaff, 2005/06/03
 Re: casefile.c revision, John Darrington, 2005/06/03
 Re: casefile.c revision, Ben Pfaff, 2005/06/04
 Re: casefile.c revision, John Darrington, 2005/06/04
 Re: casefile.c revision, Ben Pfaff, 2005/06/24
 Re: casefile.c revision, John Darrington, 2005/06/25