Re: [patch #5690] Clean up case code

pspp-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [patch #5690] Clean up case code

From:	Ben Pfaff
Subject:	Re: [patch #5690] Clean up case code
Date:	Tue, 30 Jan 2007 22:30:43 -0800
User-agent:	Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)
John Darrington <address@hidden> writes:

> On Tue, Jan 30, 2007 at 06:34:09AM -0800, Ben Pfaff wrote:
>         Here's the current header file.  It uses a casereader instead
>         of a casefile because casefiles have gone away in my source
>         tree.  Currently I'm calling it a "datasheet" instead of a
>         flexifile because that seems to be a better name given that my
>         reworked source tree has no need for casefiles or multiple
>         implementations of casefiles or casefile factories.  
>       
> Does this mean that random case access is now as efficient as
> sequential access?  Based upon previous discussions, that would seem
> to be a rather fundamental axiom.  

Random access will never be as efficient as sequential access.
As long as we continue to use spinning metal discs for storage,
seeks are going to continue to be expensive.  Even if we switch
to, say, bulk nonvolatile RAM for storage, there's still going to
be reduced latency for workloads that have predictable access
patterns (e.g. benefit from readahead).  So I'm not proposing to
encourage use of random access where it's not necessary.

But no one wants to listen to me preach.  The status quo in PSPP
is that we have casefiles that you can write to until you want to
read from them.  Thereafter you can read from them, starting at
the beginning and proceeding forward case-by-case.  You can also
clone a casereader.  There's also a whole pantheon of entities
from which cases may be read (proc_open/proc_read/proc_close,
case_source, sfm_reader, pfm_reader, any_reader, scratch_reader,
storage_source, and others) and a separate pantheon to which
cases may be written (case_sink, sfm_writer, pfm_writer,
any_write, scratch_writer, sort, storage_sink, and others).  None
of these things has the same interface as any one of the others,
even though every reader of cases and every writer of cases
really has the same requirements.

In my working tree, I simplified things by defining two general
"classes": casereader and casewriter.  There are many
implementations of each class--most of the entities above becomes
an implementation of one or the other--but the interface is
simple and quite narrow.  A casereader only needs to implement
two functions, which are accessed this way:
    bool casereader_read (struct casereader *, struct ccase *);
    bool casereader_destroy (struct casereader *);
Once you read a case, it's gone forever from the point of view of
that casereader.  Therefore, there's another commonly useful
function:
    struct casereader *casereader_clone (const struct casereader *);
Some casereader implementations can efficiently implement cloned
casereaders on their own.  For those that don't, there's a
default implementation of clone that queues up a backlog of cases
that have been read by some clones but not by others.  If the
backlog gets too big, it's written to disk; otherwise, it's
maintained in memory (quite efficiently in the common case
because case_clone() just increments a reference count).  There
are also a few "convenience" functions such as
    bool casereader_peek (struct casereader *, casenumber, struct ccase *);
which makes a clone of a casereader, reads ahead the given number
of cases, and returns that case after destroying the clone of the
casereader.  (Actually, casereader implementations that support
efficient random access are allowed to override this, too.
Currently in my tree the most common use of casereader_peek() is
just to look ahead a single case.)

A casewriter is a similar class for writing cases.  Casewriters
need only support two operations:
    void casewriter_write (struct casewriter *, struct ccase *);
    bool casewriter_destroy (struct casewriter *);

In my working tree, there is no direct equivalent to a casefile.
Instead, you create a casewriter, write some cases to it, and
then call a function supported by the most interesting types of
casewriters:
    struct casereader *casewriter_make_reader (struct casewriter *);
This function returns a casereader that can then be used to read
back what you wrote to the casewriter.

The really cool thing about this scheme is that you can stack
casereaders on top of one another.  For example, consider the
problem of filtering out cases that have a missing value in one
or more of their values.  At the moment we have a separate entity
(a casefilter) for this, which must be invoked explicitly, and
before casefilters it was even cruder.  In my working tree,
though, you just create a new casereader that reads cases from
the one you already have and drops the cases you don't want,
e.g. here's an example from the RANK procedure:
  input = casereader_create_filter_missing (input, &rank_var, 1,
                                            exclude_values, output);
Cases read from "input" thereafter that have a missing value for
rank_var will be written directly to "output" without further
processing.  (If "output" were NULL then they'd just get dropped,
but RANK is a funny case.)

Probably the most powerful thing to stack on top of a casereader
is what I'm tentatively calling a "casegrouper".  A casegrouper
takes a casereader and a function that classifies consecutive
pairs of cases as in the same group or different groups.  It then
hands you a sequence of casereaders, one by one, each of which
contains a single group.  This is invaluable for SPLIT FILE, for
break groups on AGGREGATE or RANK or SORT CASES, and so on.

All in all, this reduces the meat of RANK from something almost
incomprehensible to just the following, which should be easier to
understand even without thorough knowledge of what each of the
functions does:

static void
rank_sorted_file (struct casereader *input, 
                  struct casewriter *output,
                  const struct dictionary *dict,
                  const struct rank_spec *rs, 
                  int n_rank_specs, 
                  int dest_idx, 
                  struct variable *rank_var)
{
  struct casereader *pass1, *pass2, *pass2_1;
  struct casegrouper *tie_grouper;
  struct ccase c;
  double w = 0.0;
  double cc = 0.0;
  int tie_group = 1;


  input = casereader_create_filter_missing (input, &rank_var, 1,
                                            exclude_values, output);
  input = casereader_create_filter_weight (input, dict, NULL, output);

  casereader_split (input, &pass1, &pass2);

  /* Pass 1: Get total group weight. */
  for (; casereader_read (pass1, &c); case_destroy (&c)) 
    w += dict_get_case_weight (dict, &c, NULL);
  casereader_destroy (pass1);

  /* Pass 2: Do ranking. */
  tie_grouper = casegrouper_create_vars (pass2, &rank_var, 1);
  while (casegrouper_get_next_group (tie_grouper, &pass2_1)) 
    {
      struct casereader *pass2_2 = casereader_clone (pass2_1);
      double cc_1 = cc;
      double tw = 0.0;
      int i;

      /* Pass 2.1: Sum up weight for tied cases. */
      for (; casereader_read (pass2_1, &c); case_destroy (&c)) 
        tw += dict_get_case_weight (dict, &c, NULL);
      cc += tw;
      casereader_destroy (pass2_1);

      /* Pass 2.2: Rank tied cases. */
      while (casereader_read (pass2_2, &c)) 
        {
          for (i = 0; i < n_rank_specs; ++i)
            {
              const struct variable *dst_var = rs[i].destvars[dest_idx];
              double *dst_value = &case_data_rw (&c, dst_var)->f;
              *dst_value = rank_func[rs[i].rfunc] (tw, cc, cc_1, tie_group, w);
            }
          casewriter_write (output, &c);
        }
      casereader_destroy (pass2_2);
          
      tie_group++;
    }
  casegrouper_destroy (tie_grouper);

  /* FIXME: needs to detect errors. */
}

The effect on other procedures is less pronounced, but still
positive.

I wasn't planning to start evangelizing this until I was closer
to finishing up my source patch.  What I've done already:

        * Implemented everything described above, and more,
          except that the datasheet is incomplete.

        * Make "make check" run clean, without any valgrind
          complaints, and without any memory leaks.  Oh, except
          that rank.sh is still giving some complaints; I thought
          I'd fixed them but I guess not.

What I have left:

        * Make the GUI compile and work again.  Currently it does
          neither.  As part of that, finish and test the
          datasheet implementation.

          I might need help or advice with some of the GUI stuff,
          but I don't know yet.

        * Add comments and change logs.  Currently my new code
          has hardly any.

        * Write an extensive section for the manual describing
          best practices for data processing under PSPP.  I'm
          confident that, with this set of changes, PSPP data
          processing will be mature enough that we can provide
          good guidance for future developers this way.

          I might break this into a separate developers' guide,
          along with the existing chapter on q2c.  What do you
          think?

        * Do some code profiling and performance comparisons
          against current CVS to makes sure that this new method
          isn't slower.

I'm really excited about this set of changes.  It feels to me
like one-third of the important PSPP implementation (the data
processing) is finally coming together.  The other two-thirds are
syntax parsing and output formatting (see the end of the PSPP
README), and I finally have ideas for those that I think will
really work.

This is going to be a big patch, by the way.  I'm going to do my
best to break out the most interesting parts of the change set
into separate patches, but there's going to be a lot to look at.
Fortunately, it's actually a *reduction* of the source base:
diffstat for my current working directory reports "115 files
changed, 4141 insertions(+), 6329 deletions(-)".  I *like* it
when improvements delete 2000 lines of code!  Of course, when I
add comments some of the deletions will disappear.

Now, what were we talking about originally?  Oh yeah, sequential
vs. random access.  If you mean, will random access in a
datasheet be efficient?  Yes, it should be just fine in the
common case.

I'm talking too much.  I'll shut up now.

>         Maybe "datasheet" is too close to a name used in the GUI
>         code.  In that case I'm open to other names; I could even go
>         back to flexifile if you prefer.
>
> Maybe data_model would be better --- in the MVC paradigm the datasheet
> is the thing which views the data whereas the data-model is the
> description of how the data presents itself.

data_model is a really really generic name.  It could be a name
for the model for any kind of data.  The name datasheet calls to
my mind a spreadsheet, which more specifically describes what the
datasheet actually implements.  So I'm not 100% happy with the
suggestion data_model.

Here's my full tree's diffstat, for what it's worth:
 configure.ac                               |    1 
 src/data/ChangeLog                         |   43 +
 src/data/any-reader.c                      |  113 ---
 src/data/any-reader.h                      |    6 
 src/data/any-writer.c                      |  158 -----
 src/data/any-writer.h                      |   13 
 src/data/automake.mk                       |   30 -
 src/data/case-sink.c                       |   65 --
 src/data/case-sink.h                       |   64 --
 src/data/case-source.c                     |   62 --
 src/data/case-source.h                     |   61 --
 src/data/case.c                            |    4 
 src/data/case.h                            |   11 
 src/data/casebuffer.c                      |  201 ++++++
 src/data/casebuffer.h                      |   39 +
 src/data/casefile-private.h                |  102 ---
 src/data/casefile.c                        |  344 -----------
 src/data/casefile.h                        |   71 --
 src/data/casefilter.c                      |  119 ----
 src/data/casefilter.h                      |   55 -
 src/data/casegrouper.c                     |  184 ++++++
 src/data/casegrouper.h                     |   48 +
 src/data/caseinit.c                        |  229 +++++++
 src/data/caseinit.h                        |   36 +
 src/data/casereader-private.h              |   46 +
 src/data/casereader.c                      |  559 +++++++++++++++++++
 src/data/casereader.h                      |   71 ++
 src/data/casewindow.c                      |  326 +++++++++++
 src/data/casewindow.h                      |   37 +
 src/data/casewriter-private.h              |   38 +
 src/data/casewriter.c                      |  285 +++++++++
 src/data/casewriter.h                      |   49 +
 src/data/dictionary.c                      |    3 
 src/data/fastfile.c                        |  788 ---------------------------
 src/data/fastfile.h                        |   26 
 src/data/por-file-reader.c                 |   83 +-
 src/data/por-file-reader.h                 |    5 
 src/data/por-file-writer.c                 |   90 +--
 src/data/por-file-writer.h                 |    6 
 src/data/procedure.c                       |  734 +++++++------------------
 src/data/procedure.h                       |   71 --
 src/data/scratch-handle.c                  |    8 
 src/data/scratch-handle.h                  |    2 
 src/data/scratch-reader.c                  |   58 -
 src/data/scratch-reader.h                  |    7 
 src/data/scratch-writer.c                  |   81 +-
 src/data/scratch-writer.h                  |    7 
 src/data/settings.c                        |   10 
 src/data/settings.h                        |    1 
 src/data/storage-stream.c                  |  204 -------
 src/data/storage-stream.h                  |   32 -
 src/data/sys-file-reader.c                 |  157 +++--
 src/data/sys-file-reader.h                 |    5 
 src/data/sys-file-writer.c                 |   78 +-
 src/data/sys-file-writer.h                 |    6 
 src/data/transformations.h                 |    4 
 src/language/command.c                     |    9 
 src/language/command.def                   |    1 
 src/language/control/do-if.c               |    1 
 src/language/data-io/data-list.c           |   54 +
 src/language/data-io/data-reader.c         |    5 
 src/language/data-io/get.c                 |  340 ++++-------
 src/language/data-io/inpt-pgm.c            |  141 +---
 src/language/data-io/list.q                |   66 +-
 src/language/dictionary/apply-dictionary.c |    7 
 src/language/dictionary/delete-variables.c |   10 
 src/language/dictionary/modify-variables.c |    3 
 src/language/dictionary/sys-file-info.c    |    5 
 src/language/expressions/evaluate.c        |    2 
 src/language/lexer/variable-parser.c       |    6 
 src/language/stats/ChangeLog               |   17 
 src/language/stats/aggregate.c             |  226 ++-----
 src/language/stats/autorecode.c            |   15 
 src/language/stats/binomial.c              |   35 -
 src/language/stats/binomial.h              |    6 
 src/language/stats/chisquare.c             |   64 --
 src/language/stats/chisquare.h             |    8 
 src/language/stats/crosstabs.q             |   73 +-
 src/language/stats/descriptives.c          |   76 +-
 src/language/stats/examine.q               |   51 -
 src/language/stats/flip.c                  |  210 ++-----
 src/language/stats/frequencies.q           |   57 +
 src/language/stats/npar-summary.c          |   36 -
 src/language/stats/npar-summary.h          |    9 
 src/language/stats/npar.h                  |    4 
 src/language/stats/npar.q                  |   52 -
 src/language/stats/oneway.q                |   83 +-
 src/language/stats/rank.q                  |  390 ++++---------
 src/language/stats/regression.q            |  290 +++------
 src/language/stats/sort-cases.c            |   27 
 src/language/stats/sort-criteria.c         |  126 +---
 src/language/stats/sort-criteria.h         |   13 
 src/language/stats/t-test.q                |  135 ++--
 src/language/tests/automake.mk             |    1 
 tests/command/rank.sh                      |   90 ---
 tests/libpspp/heap-test.c                  |    2 
 tests/libpspp/ll-test.c                    |    2 
 tests/libpspp/llx-test.c                   |    2 
 tests/xforms/casefile.sh                   |   72 --
 115 files changed, 4141 insertions(+), 6329 deletions(-)

-- 
"Because computer source code is an expressive means for the exchange
 of information and ideas about computer programming, we hold that it
 is protected by the First Amendment."
--Hon. Boyce F. Martin, Jr., for the 6th Circuit Court, Junger vs. Daley
[Prev in Thread]
Current Thread
[Next in Thread]
[patch #5690] Clean up case code, Ben Pfaff, 2007/01/15
- [patch #5690] Clean up case code, John Darrington, 2007/01/15
  - [patch #5690] Clean up case code, Ben Pfaff, 2007/01/15
    - [patch #5690] Clean up case code, John Darrington, 2007/01/16
    - [patch #5690] Clean up case code, Ben Pfaff, 2007/01/16
- Re: [patch #5690] Clean up case code, John Darrington, 2007/01/30
  - Re: [patch #5690] Clean up case code, Ben Pfaff, 2007/01/30
    - Re: [patch #5690] Clean up case code, Ben Pfaff, 2007/01/30
    - Re: [patch #5690] Clean up case code, John Darrington, 2007/01/30
    - Re: [patch #5690] Clean up case code, Ben Pfaff <=
    - Re: [patch #5690] Clean up case code, Ben Pfaff, 2007/01/31
    - Re: [patch #5690] Clean up case code, John Darrington, 2007/01/31
    - Re: [patch #5690] Clean up case code, Ben Pfaff, 2007/01/31
    - Sequential vs. Random access, John Darrington, 2007/01/31
Prev by Date: Re: [patch #5690] Clean up case code
Next by Date: Re: [patch #5690] Clean up case code
Previous by thread: Re: [patch #5690] Clean up case code
Next by thread: Re: [patch #5690] Clean up case code
Index(es):
- Date
- Thread