pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: casefile.c revision


From: Ben Pfaff
Subject: Re: casefile.c revision
Date: Thu, 02 Jun 2005 21:59:16 -0700
User-agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> Let's say we're doing  RANK X /TIES=MEAN.
>
> and the X data contains '1', '2', '4' and N cases of '3' where N is some
> unknown but very large integer.

[...]

>  but the problem is, that we don't know the value of N until we've
>  read ALL the '3's.  Thus, if  N happens to have the value 100,000,000
>  then we cannot  write the R_i value without resorting  to a random access.

Ah.  I see.  To my mind, this is a little different from random
access.  It's more like a "bookmark", in effect "Here, keep my
place for me while I peek ahead a little bit".

I can see a bunch of ways this might be implemented.  First, we
could literally implement something like a bookmark.  The
following two ways are equivalent, I think, but they are
conceptually a bit different:

        A. Add a casereader_clone() that makes a new copy of a
           casereader, so that we can read ahead in one
           casereader and then make another pass across that same
           data in anther one.

        B. Create a new "class" called a casefile_position (or
           maybe call it a "casemark"?).  Then add a
           casereader_tell() to save a position and
           casereader_rewind() to go back to a position.  We'd
           also want a casefile_position_destroy() (see below).

Second, we could use an intermediate casefile:

        C. Copy each case into the intermediate casefile as we
           go.  When we know what the rank of a set of cases will
           be, copy the intermediate casefile into the final
           casefile, changing the ranks as we go.  Typically the
           intermediate casefile would only have 1 case in it (as
           for 1, 2, and 4 in your example above) but it could
           end up with 100,000,000 cases (as for 3 in your
           example above).

I like A and C the best.  I don't think C would even need any
change to the casefile code, although some optimization might be
helpful.

So... why the heck do I think that this is better than just a
"random seek" operation?  Well, mostly because I like to think of
casefiles as something that you usually stream from one place to
another.  That is, I like to think of them as analogous to pipes,
not to files.

In particular, I want casefiles to be able to support
"destructive readers", which are readers where once you've read a
case, it's gone--deleted, destroyed, etc.  A destructive reader
can be useful because, when the casefile data is in-memory, the
reader doesn't have to make a copy of data if it wants to modify
it; instead it can just modify the copy that the casefile had.
The copy-on-write case implementation in case.[ch] supports this
out-of-the-box.

Unfortunately, supporting random access means that this useful
optimization isn't possible, because at any time we could seek
backward to the first case (and expect to find it in its
unmodified form).  On the other hand, if we have to indicate how
many records back we can go (as in A or B) we can still discard
anything that lies before any marker.  (This is why we'd want a
casefile_position_destroy() in B: so that we know when markers
are gone and can thus discard anything before them.)

Am I making sense?

>  The only way I can see to solve this problem (given a ban on random
>  access) requires the ability to  iterate a casefile in reverse order
>  (or else we have to pay the  penalty of another sort).  

Being able to go backward has the same caveat as random access
from the viewpoint of being able to discard old cases.

> The questions that I have in the back of my mind are:

[...]

> c.  Is there an easier way to do this which I've missed completely?

What do you think of my suggestions?
-- 
"There's only one thing that will make them stop hating you.
 And that's being so good at what you do that they can't ignore you.
 I told them you were the best.  Now you damn well better be."
--Orson Scott Card, _Ender's Game_




reply via email to

[Prev in Thread] Current Thread [Next in Thread]