documentation for ADD FILES and UPDATE

pspp-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
documentation for ADD FILES and UPDATE

From:	Ben Pfaff
Subject:	documentation for ADD FILES and UPDATE
Date:	Sun, 26 Oct 2008 20:33:13 -0700
User-agent:	Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
[sending this to pspp-users too in the hope that some users will
comment on this documentation]

I'm working on the ADD FILES and UPDATE commands.  I'm hoping for
some pre-review of the documentation for them.  I'm having
trouble telling whether it is sufficiently clear.  Here is the
patch that I have at the moment.

Probably, at the very least, it needs some examples.

commit 6d65ac6f271cd1cec70f949ec17c91e5c63c333d
Author: Ben Pfaff <address@hidden>
Date:   Sun Oct 26 20:32:11 2008 -0700

    Document ADD FILES and UPDATE, and update docs for MATCH FILES.

diff --git a/doc/automake.mk b/doc/automake.mk
index 3445379..14359e6 100644
--- a/doc/automake.mk
+++ b/doc/automake.mk
@@ -11,6 +11,7 @@ doc_pspp_TEXINFOS = doc/version.texi \
        doc/data-selection.texi \
        doc/expressions.texi \
        doc/files.texi \
+       doc/combining.texi \
        doc/flow-control.texi \
        doc/function-index.texi \
        doc/installing.texi \
diff --git a/doc/combining.texi b/doc/combining.texi
new file mode 100644
index 0000000..086c162
--- /dev/null
+++ b/doc/combining.texi
@@ -0,0 +1,336 @@
address@hidden Combining Data Files
address@hidden Combining Data Files
+
+This chapter describes commands that allow data from system files,
+portable file, scratch files, and the active file to be combined to
+form a new active file.  These commands can combine data files in the
+following ways:
+
address@hidden
address@hidden
address@hidden FILES} interleaves or appends the cases from each input file.
+It is used with input files that have variables in common, but
+distinct sets of cases.
+
address@hidden
address@hidden FILES} adds the data together in cases that match across
+multiple input files.  It is used with input files that have cases in
+common, but different information about each case.
+
address@hidden
address@hidden updates a master data file from data in a set of
+transaction files.  Each case in a transaction data file modifies a
+matching case in the primary data file, or it adds a new case if no
+matching case can be found.
address@hidden itemize
+
+These commands share a great deal of syntax, described in the
+following section, which is followed by one section for each
+particular command.
+
address@hidden
+* Combining Files Common Syntax::
+* ADD FILES::                   Interleave cases from multiple files.
+* MATCH FILES::                 Merge cases from multiple files.
+* UPDATE::                      Update cases using transactional data.
address@hidden menu
+
address@hidden Combining Files Common Syntax
address@hidden Common Syntax
+
address@hidden
+Per input file:
+        /address@hidden,'file-name'@}
+        [/RENAME=(src_names=target_names)@dots{}]
+        [/IN=var_name]
+        [/SORT]
+
+Once per command:
+        /BY var_list[(@{D|address@hidden)] [var_list[(@{D|address@hidden@dots{}
+        [/DROP=var_list]
+        [/KEEP=var_list]
+        [/FIRST=var_name]
+        [/LAST=var_name]
+        [/MAP]
address@hidden display
+
+This section describes the syntactical features in common among the
address@hidden FILES}, @cmd{MATCH FILES}, and @cmd{UPDATE} commands.  The
+following sections describe details specific to each command.
+
+Each of these commands reads two or more input files and combines
+them.  The command's output becomes the new active file.  The input
+files are not changed on disk.
+
+The syntax of each command begins with a specification of the files to
+be read as input.  For each input file, specify FILE with a system,
+portable, or scratch file's name as a string or a file handle
+(@pxref{File Handles}), or specify an asterisk (@samp{*}) to use the
+active file as input.  Use of portable or scratch files on FILE is a
+PSPP extension.
+
+At least two FILE subcommands must be specified.  If the active file
+is used as an input source, then @cmd{TEMPORARY} must not be in
+effect.
+
+Each FILE subcommand may be followed by any number of RENAME
+subcommands that specify a parenthesized group or groups of variable
+names as they appear in the input file, followed by those variables'
+new names, separated by an equals sign (@samp{=}),
+e.g. @samp{/RENAME=(OLD1=NEW1)(OLD2=NEW2)}.  To rename a single
+variable, the parentheses may be omitted: @samp{/RENAME=OLD=NEW}.
+Within a parenthesized group, variables are renamed simultaneously, so
+that @samp{/RENAME=(A B=B A)} exchanges the names of variables A and
+B.  Otherwise, renaming occurs in left-to-right order.
+
+Each FILE subcommand may optionally be followed by a single IN
+subcommand, which creates a numeric variable with the specified name
+and format F1.0.  The IN variable takes value 1 in an output case if
+the given input file contributed to that output case, and 0 otherwise.
+The DROP, KEEP, and RENAME subcommands have no effect on IN variables.
+
+If BY is used (see below), the SORT keyword must be specified after a
+FILE if that input file is not already sorted on the BY variables.
+When SORT is specified, PSPP sorts the input file's data on the BY
+variables before it applies it to the command.  When SORT is used, BY
+is required.  SORT is a PSPP extension.
+
+PSPP merges the dictionaries of all of the input files to form the new
+active file dictionary, like so:
+
address@hidden @bullet
address@hidden
+The new active file's variables are the union of all the input files'
+variables, matched based on their name.  When a single input file
+contains a variable with a given name, the output file will contain
+exactly that variable.  When more than one input file contains a
+variable with a given name, those variables must all have the same
+type (numeric or string) and, for string variables, the same
+width.Variables are matched after renaming with the RENAME subcommand.
+Thus, RENAME can be used to resolve conflicts.
+
address@hidden
+The variable label for each output variable is taken from the first
+specified input file that has a variable label for that variable, and
+similarly for value labels and missing values.
+
address@hidden
+The new active file's file label (@pxref{FILE LABEL}) is that of the
+first specified FILE that has a file label.
+
address@hidden
+The new active file's documents (@pxref{DOCUMENT}) are the
+concatenation of all the input files' documents, in the order in which
+the FILE subcommands are specified.
+
address@hidden
+If all of the input files are weighted on the same variable, then the
+new active file is weighted on that variable.  Otherwise, the new
+active file is not weighted.
address@hidden itemize
+
+The remaining subcommands apply to the output file as a whole, rather
+than to individual input files.  They must be specified at the end of
+the command specification, following all of the FILE and related
+subcommands.  The most important of these subcommands is BY, which
+specifies a set of one or more variables that may be used to find
+corresponding cases in each of the input files.  The variables
+specified on BY must be present in all of the input files.
+Furthermore, if any of the input files are not sorted on the BY
+variables, then SORT must be specified for those input files.
+
+The variables listed on BY may include (A) or (D) annotations to
+specify ascending or descending sort order.  @xref{SORT CASES}, for
+more details on this notation.  Adding (A) or (D) to the BY subcommand
+specification is a PSPP extension.
+
+The DROP subcommand can be used to specify a list of variables to
+exclude from the output.  By contrast, the KEEP subcommand can be used
+to specify variables to include in the output; all variables not
+listed are dropped.  DROP and KEEP are executed in left-to-right order
+and may be repeated any number of times.  DROP and KEEP do not affect
+variables created by the IN, FIRST, and LAST subcommands, which are
+always included in the new active file, but they can be used to drop
+BY variables.
+
+The FIRST and LAST subcommands are optional.  They may only be
+specified on @cmd{MATCH FILES} and @cmd{ADD FILES}, and only when BY
+is used.  FIRST and LIST each adds a numeric variable to the new
+active file, with the name given as the subcommand's argument and F1.0
+print and write formats.  The value of the FIRST variable is 1 in the
+first output case with a given set of values for the BY variables, and
+0 in other cases.  Similarly, the LAST variable is 1 in the last case
+with a given of BY values, and 0 in other cases.
+
+When any of these commands creates an output case, variables that are
+only in files that are not present for the current case are set to the
+system-missing value for numeric variables or spaces for string
+variables.
+
address@hidden ADD FILES
address@hidden ADD FILES
address@hidden ADD FILES
+
address@hidden
+ADD FILES
+
+Per input file:
+        /address@hidden,'file-name'@}
+        [/RENAME=(src_names=target_names)@dots{}]
+        [/IN=var_name]
+        [/SORT]
+
+Once per command:
+        [/BY var_list[(@{D|address@hidden)] 
[var_list[(@{D|address@hidden)address@hidden
+        [/DROP=var_list]
+        [/KEEP=var_list]
+        [/FIRST=var_name]
+        [/LAST=var_name]
+        [/MAP]
address@hidden display
+
address@hidden FILES} adds cases from multiple input files.  The output,
+which replaces the active file, consists all of the cases in all of
+the input files.
+
+ADD FILES shares the bulk of its syntax with other PSPP commands for
+combining multiple data files.  @xref{Combining Files Common Syntax},
+above, for an explanation of this common syntax.
+
+When BY is not used, the output of ADD FILES consists of all the cases
+from the first input file specified, followed by all the cases from
+the second file specified, and so on.  When BY is used, the output is
+additionally sorted on the BY variables.
+
+When ADD FILES creates an output case, variables that are not part of
+the input file from which the case was drawn are set to the
+system-missing value for numeric variables or spaces for string
+variables.
+
address@hidden MATCH FILES
address@hidden MATCH FILES
address@hidden MATCH FILES
+
address@hidden
+MATCH FILES
+
+Per input file:
+        /@{FILE,address@hidden@{*,'file-name'@}
+        [/RENAME=(src_names=target_names)@dots{}]
+        [/IN=var_name]
+        [/SORT]
+
+Once per command:
+        /BY var_list[(@{D|address@hidden 
[var_list[(@{D|address@hidden)address@hidden
+        [/DROP=var_list]
+        [/KEEP=var_list]
+        [/FIRST=var_name]
+        [/LAST=var_name]
+        [/MAP]
address@hidden display
+
address@hidden FILES} merges sets of corresponding cases in multiple
+input files into single cases in the output, combining their data.
+
+MATCH FILES shares the bulk of its syntax with other PSPP commands for
+combining multiple data files.  @xref{Combining Files Common Syntax},
+above, for an explanation of this common syntax.
+
+How MATCH FILES matches up cases from the input files depends on
+whether BY is specified:
+
address@hidden @bullet
address@hidden
+If BY is not used, MATCH FILES combines the first case from each input
+file to produce the first output case, then the second case from each
+input file for the second output case, and so on.  If some input files
+have fewer cases than others, then the shorter files do not contribute
+to cases output after their input has been exhausted.
+
address@hidden
+If BY is used, MATCH FILES combines cases from each input file that
+have identical values for the BY variables.
+
+When BY is used, TABLE subcommands may be used to introduce @dfn{table
+lookup file}.  TABLE has same syntax as FILE, and the RENAME, IN, and
+SORT subcommands may follow a TABLE in the same way as a FILE.
+Regardless of the number of TABLEs, at least one FILE must specified.
+Table lookup files are treated in the same way as other input files
+for most purposes and, in particular, table lookup files must be
+sorted on the BY variables or the SORT subcommand must be specified
+for that TABLE.
+
+Cases in table lookup files are not consumed after they have been used
+once.  This means that data in table lookup files can correspond to
+any number of cases in FILE input files.  Table lookup files are
+analogous to lookup tables in traditional relational database systems.
+
+If a table lookup file contains more than one case with a given set of
+BY variables, only the first case is used.
address@hidden itemize
+
+When MATCH FILES creates an output case, variables that are only in
+files that are not present for the current case are set to the
+system-missing value for numeric variables or spaces for string
+variables.
+
address@hidden UPDATE
address@hidden UPDATE
address@hidden UPDATE
+
address@hidden
+UPDATE
+
+Per input file:
+        /address@hidden,'file-name'@}
+        [/RENAME=(src_names=target_names)@dots{}]
+        [/IN=var_name]
+        [/SORT]
+
+Once per command:
+        /BY var_list[(@{D|address@hidden)] 
[var_list[(@{D|address@hidden)address@hidden
+        [/DROP=var_list]
+        [/KEEP=var_list]
+        [/MAP]
address@hidden display
+
address@hidden updates a @dfn{master file} by applying modifications
+from one or more @dfn{transaction files}.  
+
+UPDATE shares the bulk of its syntax with other PSPP commands for
+combining multiple data files.  @xref{Combining Files Common Syntax},
+above, for an explanation of this common syntax.
+
+At least two FILE subcommands must be specified.  The first FILE
+subcommand names the master file, and the rest name transaction files.
+Every input file must either be sorted on the variables named on the
+BY subcommand, or the SORT subcommand must be used just after the FILE
+subcommand for that input file.
+
+UPDATE uses the variables specified on the BY subcommand, which is
+required, to attempt to match each case in a transaction file with a
+case in the master file:
+
address@hidden @bullet
address@hidden
+When a match is found, then the values of the variables present in the
+transaction file replace those variable's values in the new active
+file.  If there are matching cases in more than more transaction file,
+PSPP applies the replacements from the first transaction file, then
+from the second transaction file, and so on.  Similarly, if a single
+transaction file has cases with duplicate BY values, then those are
+applied in order to the master file.
+
+When a variable in a transaction file has a missing value or a string
+variable's value is all blanks, that value is never used to update the
+master file.
+
address@hidden
+If a case in the master file has no matching case in any transaction
+file, then it is copied unchanged to the output.
+
address@hidden
+If a case in a transaction file has no matching case in the master
+file, then it causes a new case to be added to the output, initialized
+from the values in the transaction file.
address@hidden itemize
diff --git a/doc/files.texi b/doc/files.texi
index ae10ec7..2fd9892 100644
--- a/doc/files.texi
+++ b/doc/files.texi
@@ -1,5 +1,5 @@
address@hidden System and Portable Files
address@hidden System Files and Portable Files
address@hidden System and Portable File IO
address@hidden System and Portable File I/O
 
 The commands in this chapter read, write, and examine system files and
 portable files.
@@ -10,7 +10,6 @@ portable files.
 * GET::                         Read from a system file.
 * GET DATA::                    Read from foreign files.
 * IMPORT::                      Read from a portable file.
-* MATCH FILES::                 Merge system files.
 * SAVE::                        Write to a system file.
 * SYSFILE INFO::                Display system file dictionary.
 * XEXPORT::                     Write to a portable file, as a transformation.
@@ -651,99 +650,6 @@ data is read later, when a procedure is executed.
 Use of @cmd{IMPORT} to read a system file or scratch file is a PSPP
 extension.
 
address@hidden MATCH FILES
address@hidden MATCH FILES
address@hidden MATCH FILES
-
address@hidden
-MATCH FILES
-        /@{FILE,address@hidden@{*,'file-name'@}
-        /RENAME=(src_names=target_names)@dots{}
-        /IN=var_name
-
-        /BY=var_list
-        /DROP=var_list
-        /KEEP=var_list
-        /FIRST=var_name
-        /LAST=var_name
-        /MAP
address@hidden display
-
address@hidden FILES} merges one or more system, portable, or scratch files,
-optionally
-including the active file.  Cases with the same values for BY
-variables are combined into a single case.  Cases with different
-values are output in order.  Thus, multiple sorted files are
-combined into a single sorted file based on the value of the BY
-variables.  The results of the merge become the new active file.
-
-Specify FILE with a system, portable, or scratch file as a file name
-string or file handle
-(@pxref{File Handles}), or with an asterisk (@samp{*}) to
-indicate the current active file.  The files specified on FILE are
-merged together based on the BY variables, or combined case-by-case if
-BY is not specified.
-
-Specify TABLE with a file to use it as a @dfn{table
-lookup file}.  Cases in table lookup files are not used up after
-they've been used once.  This means that data in table lookup files can
-correspond to any number of cases in FILE files.  Table lookup files
-correspond to lookup tables in traditional relational database systems.
-If a table lookup file contains more than one case with a given set of
-BY variables, only the first case is used.
-
-Any number of FILE and TABLE subcommands may be specified.
-Ordinarily, at least two FILE subcommands, or one FILE and at least
-one TABLE, should be specified.  Each instance of FILE or TABLE can be
-followed by any sequence of RENAME subcommands.  These have the same
-form and meaning as the corresponding subcommands of @cmd{GET}
-(@pxref{GET}), but apply only to variables in the given file.
-
-Each FILE or TABLE may optionally be followed by an IN subcommand,
-which creates a numeric variable with the specified name and format
-F1.0.  The IN variable takes value 1 in a case if the given file
-contributed a row to the merged file, 0 otherwise.  The DROP, KEEP,
-and RENAME subcommands do not affect IN variables.
-
-When more than one FILE or TABLE contains a variable with a given
-name, those variables must all have the same type (numeric or string)
-and, for string variables, the same width.  This rules applies to
-variable names after renaming with RENAME; thus, RENAME can be used to
-resolve conflicts.
-
-FILE and TABLE must be specified at the beginning of the command, with
-any RENAME or IN specifications immediately after the corresponding
-FILE or TABLE.  These subcommands are followed by BY, DROP, KEEP,
-FIRST, LAST, and MAP.
-
-The BY subcommand specifies a list of variables that are used to match
-cases from each of the files.  When TABLE or IN is used, BY is
-required; otherwise, it is optional.  When BY is specified, all the
-files named on FILE and TABLE subcommands must be sorted in ascending
-order of the BY variables.  Variables belonging to files that are not
-present for the current case are set to the system-missing value for
-numeric variables or spaces for string variables.
-
-The DROP and KEEP subcommands allow variables to be dropped from or
-reordered within the new active file.  These subcommands have the same
-form and meaning as the corresponding subcommands of @cmd{GET}
-(@pxref{GET}).  They apply to the new active file as a whole, not to
-individual input files.  The variable names specified on DROP and KEEP
-are those after any renaming with RENAME.
-
-The optional FIRST and LAST subcommands name variables that @cmd{MATCH
-FILES} adds to the active file.  The new variables are numeric with
-print and write format F1.0.  The value of the FIRST variable is 1 in
-the first case with a given set of values for the BY variables, and 0
-in other cases.  Similarly, the LAST variable is 1 in the last case
-with a given of BY values, and 0 in other cases.
-
address@hidden FILES} may not be specified following @cmd{TEMPORARY}
-(@pxref{TEMPORARY}) if the active file is used as an input source.
-
-Use of portable or scratch files on @cmd{MATCH FILES} is a PSPP
-extension.
-
 @node SAVE
 @section SAVE
 @vindex SAVE
diff --git a/doc/pspp.texinfo b/doc/pspp.texinfo
index 6171df6..975ee66 100644
--- a/doc/pspp.texinfo
+++ b/doc/pspp.texinfo
@@ -70,7 +70,8 @@ modify this GNU manual.''
 * Expressions::                 Numeric and string expression syntax.
 
 * Data Input and Output::       Reading data from user files.
-* System and Portable Files::   Dealing with system & portable files.
+* System and Portable File IO:: Reading and writing system & portable files.
+* Combining Data Files::        Combining data from multiple files.
 * Variable Attributes::         Adjusting and examining variables.
 * Data Manipulation::           Simple operations on data.
 * Data Selection::              Select certain cases for analysis.
@@ -98,6 +99,7 @@ modify this GNU manual.''
 @include expressions.texi
 @include data-io.texi
 @include files.texi
address@hidden combining.texi
 @include variables.texi
 @include transformation.texi
 @include data-selection.texi

-- 
"To the engineer, the world is a toy box full of sub-optimized and
 feature-poor toys."
--Scott Adams
[Prev in Thread]
Current Thread
[Next in Thread]
documentation for ADD FILES and UPDATE, Ben Pfaff <=
- Re: documentation for ADD FILES and UPDATE, John Darrington, 2008/10/27
  - Re: documentation for ADD FILES and UPDATE, Ben Pfaff, 2008/10/27
Prev by Date: Re: criteria for version 1.0
Next by Date: Re: documentation for ADD FILES and UPDATE
Previous by thread: good sample data sets for use in documentation
Next by thread: Re: documentation for ADD FILES and UPDATE
Index(es):
- Date
- Thread