gnuastro-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuastro-commits] master 0fc33aa: Match: now possible to append non-mat


From: Mohammad Akhlaghi
Subject: [gnuastro-commits] master 0fc33aa: Match: now possible to append non-matching rows of second table
Date: Sat, 26 Jun 2021 23:06:04 -0400 (EDT)

branch: master
commit 0fc33aa1938bd5eab16b2ebd4151f81d4947d30f
Author: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Commit: Mohammad Akhlaghi <mohammad@akhlaghi.org>

    Match: now possible to append non-matching rows of second table
    
    Until now, when the input tables may have had repetative objects, the
    process of merging two catalogs first required calling Match with
    '--notmatched' to separate the objects of the second catalog that don't
    match, then append the non-matching rows to the rows of the first. But this
    extra was annoying/buggy! It would have been much more easier to do this
    internally and in a single call to Match.
    
    With this commit, we use the existing '--outcols' option and allow it to be
    called with '--notmatched'. When they are called together like this, all
    the first input's rows will be included in the output, but the non-matching
    rows of the second input will also be added (for the desired columns).
    
    To allow this, it is necessary that both inputs have the columns given to
    '--outcols' and that each column has same numeric data type as the
    corresponding column in the other catalog.
---
 NEWS              |  13 +++++
 bin/match/match.c | 162 +++++++++++++++++++++++++++++++++++++-----------------
 bin/match/ui.c    | 127 ++++++++++++++++++++++++------------------
 doc/gnuastro.texi |  43 +++++++++++++--
 4 files changed, 236 insertions(+), 109 deletions(-)

diff --git a/NEWS b/NEWS
index c076db2..185ec35 100644
--- a/NEWS
+++ b/NEWS
@@ -11,6 +11,13 @@ See the end of the file for license conditions.
    - New operands (also available in Table's column arithmetic):
      - box-around-ellipse: width and height of the box covering an ellipse.
 
+  Match:
+   - When called with '--notmatched --outcols=AAA,BBB', Match will now
+     append non-matching rows of second table into first table's rows (for
+     columns 'AAA' and 'BBB' in example above). This allows easy/clean
+     merging of two catalogs that may have some repetition (the output will
+     only contain unique rows). See description of '--outcols' for more.
+
   Library:
    - Arithmetic macros:
      - GAL_ARITHMETIC_OP_BOX_AROUND_ELLIPSE
@@ -19,6 +26,12 @@ See the end of the file for license conditions.
 
 ** Changed features
 
+  Match:
+   - The two '--notmatched' and '--outcols' can be called together (to
+     create a single catalog that appends the non-matching rows of second
+     table with the rows of the first. Until this version, this would cause
+     an error.
+
 ** Bugs fixed
   bug #60725: MakeCatalog doesn't put comment on --halfsumsb column.
   bug #60776: Radial profile script not using standard deviation image,
diff --git a/bin/match/match.c b/bin/match/match.c
index ec23d1c..b0a45fd 100644
--- a/bin/match/match.c
+++ b/bin/match/match.c
@@ -156,10 +156,8 @@ match_catalog_read_write_all(struct matchparams *p, size_t 
*permutation,
                              size_t **numcolmatch)
 {
   int hasall=0;
-  size_t origsize;
   gal_data_t *tmp, *cat;
   gal_list_str_t *cols, *tcol;
-  gal_list_void_t *arrays=NULL;
 
   char *hdu              = (f1s2==1) ? p->cp.hdu     : p->hdu2;
   gal_list_str_t *incols = (f1s2==1) ? p->acols      : p->bcols;
@@ -190,7 +188,6 @@ match_catalog_read_write_all(struct matchparams *p, size_t 
*permutation,
       else
         cols=incols;
 
-
       /* When the output contains columns from both inputs, we need to keep
          the number of columns matched against each column identifier. */
       *numcolmatch=gal_pointer_allocate(GAL_TYPE_SIZE_T,
@@ -208,36 +205,40 @@ match_catalog_read_write_all(struct matchparams *p, 
size_t *permutation,
                        p->cp.quietmmap, *numcolmatch);
   else
     cat=match_cat_from_coord(p, cols, *numcolmatch);
-  origsize = cat ? cat->size : 0;
-
 
   /* Go over each column and permute its contents. */
   if(permutation)
-    for(tmp=cat; tmp!=NULL; tmp=tmp->next)
-      {
-        /* Do the permutation. */
-        gal_permutation_apply(tmp, permutation);
-
-        /* Correct the size of the array so only the matching columns are
-           saved as output. This is only Gnuastro's convention, it has no
-           effect on later freeing of the array in the memory. */
-        if(p->notmatched)
+    {
+      /* When we are in no-match AND outcols mode, we don't need to touch
+         the rows of the first input catalog (we want all of them) */
+      if( (p->notmatched && p->outcols && f1s2==1) == 0 )
+        for(tmp=cat; tmp!=NULL; tmp=tmp->next)
           {
-            /* Add the original array pointer to a list (we need to reset it
-               later). */
-            gal_list_void_add(&arrays, tmp->array);
-
-            /* Reset the data structure's array element to start where the
-               non-matched elements start. */
-            tmp->array=gal_pointer_increment(tmp->array, nummatched,
-                                             tmp->type);
-
-            /* Correct the size of the tile. */
-            tmp->size = tmp->dsize[0] = tmp->size - nummatched;
+            /* Do the permutation. */
+            gal_permutation_apply(tmp, permutation);
+
+            /* Correct the size of the array so only the
+               matching/no-matching columns are saved as output. Note that
+               the 'size' element is only for Gnuastro, it has no effect on
+               later freeing of the array in the memory (we are not
+               'realloc'ing). */
+            if(p->notmatched)
+              {
+                /* Move the non-matched rows after permutation to the top
+                   set of rows. */
+                memcpy(tmp->array,
+                        gal_pointer_increment(tmp->array, nummatched,
+                                              tmp->type),
+                        nummatched*gal_type_sizeof(tmp->type));
+
+                /* Correct the size of the tile. */
+                tmp->size = tmp->dsize[0] = tmp->size - nummatched;
+              }
+            else
+              tmp->size = tmp->dsize[0] = nummatched;
           }
-        else
-          tmp->size=tmp->dsize[0]=nummatched;
-      }
+    }
+
   /* If no match was found ('permutation==NULL'), and the matched columns
      are requested, empty all the columns that are to be written (only
      keeping the meta-data). */
@@ -262,30 +263,81 @@ match_catalog_read_write_all(struct matchparams *p, 
size_t *permutation,
       gal_table_write(cat, NULL, NULL, p->cp.tableformat, outname,
                       extname, 0);
 
-      /* Correct arrays and sizes (when 'notmatched' was called). The
-         'array' element has to be corrected for later freeing.
+      /* Clean up. */
+      gal_list_data_free(cat);
+    }
 
-         IMPORTANT: '--notmatched' cannot be called with '--outcols'. So
-         you don't have to worry about the checks here being done later. */
-      if(p->notmatched)
-        {
-          /* Reverse the list of array pointers to write them back in. */
-          gal_list_void_reverse(&arrays);
+  return NULL;
+}
 
-          /* Correct the array and size pointers. */
-          for(tmp=cat; tmp!=NULL; tmp=tmp->next)
-            {
-              tmp->array=gal_list_void_pop(&arrays);
-              tmp->size=tmp->dsize[0]=origsize;
-              tmp->block=NULL;
-            }
+
+
+
+
+/* When merging is to be done by rows (the non-matched rows of the second
+   catalog get merged into the first for the same columns). */
+static void
+match_catalog_write_one_row(struct matchparams *p, gal_data_t *a,
+                            gal_data_t *b)
+{
+  size_t dsize=a->size+b->size;
+  gal_data_t *ta, *tb, *cat=NULL;
+
+  /* A small sanity check. */
+  if( gal_list_data_number(a) != gal_list_data_number(b) )
+    error(EXIT_FAILURE, 0, "%s: a bug! Please contact us at '%s' to "
+          "fix it. The number of columns in the two catalogs are not "
+          "equal (%zu and %zu respectively)", __func__,
+          PACKAGE_BUGREPORT, gal_list_data_number(a),
+          gal_list_data_number(b));
+
+  /* Check if there is actually any row to add? */
+  if(b->size>0)
+    {
+      /* Go over the columns of the first and make the final output columns
+         with new sizes, but same types and metadata as the first input.*/
+      tb=b;
+      for(ta=a; ta!=NULL; ta=ta->next)
+        {
+          /* Make sure both have the same type. */
+          if(ta->type!=tb->type)
+            error(EXIT_FAILURE, 0, "when '--notmatched' and '--outcols' "
+                  "are used together, the each column given to '--outcols' "
+                  "must have the same datatype in both tables. However, "
+                  "the first input has a type of '%s' for one of the "
+                  "columns, while the second has a type of '%s'",
+                  gal_type_name(ta->type, 1), gal_type_name(tb->type, 1));
+
+          /* Allocate the necessary space. */
+          gal_list_data_add_alloc(&cat, NULL, ta->type, ta->ndim,
+                                  &dsize, NULL, 0, p->cp.minmapsize,
+                                  p->cp.quietmmap, ta->name, ta->unit,
+                                  ta->comment);
+
+          /* Copy the data of the first input in output. */
+          memcpy(cat->array, ta->array,
+                 ta->size*gal_type_sizeof(ta->type));
+
+          /* Copy the data of the second input in output. */
+          memcpy(gal_pointer_increment(cat->array, ta->size, cat->type),
+                 tb->array, tb->size*gal_type_sizeof(tb->type));
+
+          /* Increment 'tb'. */
+          tb=tb->next;
         }
 
-      /* Clean up. */
+      /* Reverse the table and write it out. */
+      gal_list_data_reverse(&cat);
+      gal_table_write(cat, NULL, NULL, p->cp.tableformat, p->out1name,
+                      "MATCHED", 0);
       gal_list_data_free(cat);
     }
 
-  return NULL;
+  /* There wasn't any row to add, just write the 'a' columns and don't free
+     it ('a' will be freed in the higher-level function). */
+  else
+    gal_table_write(a, NULL, NULL, p->cp.tableformat, p->out1name,
+                    "MATCHED", 0);
 }
 
 
@@ -295,12 +347,13 @@ match_catalog_read_write_all(struct matchparams *p, 
size_t *permutation,
 /* When specific columns from both inputs are requested, this function
    will write them out into a single table. */
 static void
-match_catalog_write_one(struct matchparams *p, gal_data_t *a, gal_data_t *b,
-                        size_t *acolmatch, size_t *bcolmatch)
+match_catalog_write_one_col(struct matchparams *p, gal_data_t *a,
+                            gal_data_t *b, size_t *acolmatch,
+                            size_t *bcolmatch)
 {
   gal_data_t *cat=NULL;
-  size_t i, j, k, ac=0, bc=0, npop;
   char **strarr=p->outcols->array;
+  size_t i, j, k, ac=0, bc=0, npop;
 
   /* Go over the initial list of strings. */
   for(i=0; i<p->outcols->size; ++i)
@@ -343,6 +396,7 @@ match_catalog_write_one(struct matchparams *p, gal_data_t 
*a, gal_data_t *b,
   gal_list_data_reverse(&cat);
   gal_table_write(cat, NULL, NULL, p->cp.tableformat, p->out1name,
                   "MATCHED", 0);
+  gal_list_data_free(cat);
 }
 
 
@@ -382,12 +436,22 @@ match_catalog(struct matchparams *p)
       if(p->outcols)
         {
           /* Arrange the columns and write the output. */
-          match_catalog_write_one(p, a, b, acolmatch, bcolmatch);
+          if(p->notmatched)
+            match_catalog_write_one_row(p, a, b);
+          else
+            {
+              match_catalog_write_one_col(p, a, b, acolmatch, bcolmatch);
+              a=b=NULL; /*They are freed in function above. */
+            }
 
           /* Clean up. */
           if(acolmatch) free(acolmatch);
           if(bcolmatch) free(bcolmatch);
         }
+
+      /* Clean up. */
+      if(a) gal_list_data_free(a);
+      if(b) gal_list_data_free(b);
     }
 
   /* Write the raw information in a log file if necessary.  */
diff --git a/bin/match/ui.c b/bin/match/ui.c
index 01b6ca4..ce74de2 100644
--- a/bin/match/ui.c
+++ b/bin/match/ui.c
@@ -212,16 +212,13 @@ parse_opt(int key, char *arg, struct argp_state *state)
 /***************       Sanity Check         *******************/
 /**************************************************************/
 /* Read and check ONLY the options. When arguments are involved, do the
-   check in 'ui_check_options_and_arguments'. */
+   check in 'ui_check_options_and_arguments'.
 static void
 ui_read_check_only_options(struct matchparams *p)
 {
-  if(p->outcols && p->notmatched)
-    error(EXIT_FAILURE, 0, "'--outcols' and '--notmatched' cannot be called "
-          "at the same time. The former is only for cases when the matches "
-          "are required");
-}
 
+}
+*/
 
 
 
@@ -794,56 +791,71 @@ ui_preparations_out_cols(struct matchparams *p)
      proper list. */
   for(i=0;i<p->outcols->size;++i)
     {
+      /* For easy reading. */
       col=strarr[i];
-      switch(col[0])
+
+      /* In no-match mode, then the same column will be used from both
+         catalogs so things are easier. */
+      if(p->notmatched)
         {
-        case 'a': gal_list_str_add(&p->acols, col+1, 0); break;
-        case 'b':
-          /* With '--coord', only numbers that are smaller than the number
-             of the dimensions are acceptable. */
-          if(p->coord)
-            {
-              goodvalue=0;
-              rptr=gal_type_string_to_number(col+1, &readtype);
-              if(rptr)
-                {
-                  read=gal_data_alloc(rptr, readtype, 1, &one, NULL, 0, -1,
-                                      1, NULL, NULL, NULL);
-                  if(gal_type_is_int(readtype))
-                    {
-                      read=gal_data_copy_to_new_type_free(read,GAL_TYPE_LONG);
-                      if( *((long *)(read->array)) <= ndim )
-                        goodvalue=1;
-                    }
-                  gal_data_free(read);
-                }
-              if(goodvalue==0)
-                error(EXIT_FAILURE, 0, "bad value to second catalog "
-                      "column (%s) of '--outcols'.\n\n"
-                      "With the '--coord' option, the second catalog is "
-                      "assumed to have a single row and the given number "
-                      "of columns. Therefore when using '--outcols', only "
-                      "integers that are less than the number of "
-                      "dimensions (%zu in this case) are acceptable", col+1,
-                      ndim);
-            }
-          gal_list_str_add(&p->bcols, col+1, 0);
-          break;
-        default:
-          error(EXIT_FAILURE, 0, "'%s' is not a valid value for "
-                "'--outcols'.\n\n"
-                "The first character of each value to this option must be "
-                "either 'a' or 'b'. The former specifies a column from the "
-                "first input and the latter a column from the second. The "
-                "characters after them can be any column identifier (number, "
-                "name, or regular expression). For more on column selection, "
-                "please run this command:\n\n"
-                "    $ info gnuastro \"Selecting table columns\"\n",
-                col);
+          gal_list_str_add(&p->acols, col, 0);
+          gal_list_str_add(&p->bcols, col, 0);
         }
+
+      /* In match mode, we need to know which column should come from which
+         catalog. */
+      else
+        switch(col[0])
+          {
+          case 'a': gal_list_str_add(&p->acols, col+1, 0); break;
+          case 'b':
+            /* With '--coord', only numbers that are smaller than the
+               number of the dimensions are acceptable. */
+            if(p->coord)
+              {
+                goodvalue=0;
+                rptr=gal_type_string_to_number(col+1, &readtype);
+                if(rptr)
+                  {
+                    read=gal_data_alloc(rptr, readtype, 1, &one, NULL, 0,
+                                        -1, 1, NULL, NULL, NULL);
+                    if(gal_type_is_int(readtype))
+                      {
+                        read=gal_data_copy_to_new_type_free(read,
+                                                            GAL_TYPE_LONG);
+                        if( *((long *)(read->array)) <= ndim )
+                          goodvalue=1;
+                      }
+                    gal_data_free(read);
+                  }
+                if(goodvalue==0)
+                  error(EXIT_FAILURE, 0, "bad value to second catalog "
+                        "column (%s) of '--outcols'.\n\n"
+                        "With the '--coord' option, the second catalog "
+                        "is assumed to have a single row and the given "
+                        "number of columns. Therefore when using "
+                        "'--outcols', only integers that are less than "
+                        "the number of dimensions (%zu in this case) "
+                        "are acceptable", col+1, ndim);
+              }
+            gal_list_str_add(&p->bcols, col+1, 0);
+            break;
+          default:
+            error(EXIT_FAILURE, 0, "'%s' is not a valid value for "
+                  "'--outcols'.\n\n"
+                  "The first character of each value to this option "
+                  "must be either 'a' or 'b'. The former specifies a "
+                  "column from the first input and the latter a "
+                  "column from the second. The characters after them "
+                  "can be any column identifier (number, name, or "
+                  "regular expression). For more on column selection, "
+                  "please run this command:\n\n"
+                  "    $ info gnuastro \"Selecting table columns\"\n",
+                  col);
+          }
     }
 
-  /* Revere the lists so they correspond to the input order. */
+  /* Reverse the lists so they correspond to the input order. */
   gal_list_str_reverse(&p->acols);
   gal_list_str_reverse(&p->bcols);
 }
@@ -969,12 +981,13 @@ ui_preparations(struct matchparams *p)
 
   /* Currently Match only works on catalogs. */
   if(p->mode==MATCH_MODE_WCS)
-    error(EXIT_FAILURE, 0, "currently Match only works on catalogs, we will "
-          "implement the WCS matching routines later");
+    error(EXIT_FAILURE, 0, "currently Match only works on catalogs, "
+          "we will implement the WCS matching routines later");
   else
     {
       ui_read_columns(p);
-      if(p->outcols) ui_preparations_out_cols(p);
+      if(p->outcols)
+        ui_preparations_out_cols(p);
     }
 
   /* Set the output filename. */
@@ -1034,8 +1047,9 @@ ui_read_check_inputs_setup(int argc, char *argv[], struct 
matchparams *p)
 
 
   /* Read the options into the program's structure, and check them and
-     their relations prior to printing. */
+     their relations prior to printing.
   ui_read_check_only_options(p);
+  */
 
 
   /* Print the option values if asked. Note that this needs to be done
@@ -1093,8 +1107,11 @@ ui_free_report(struct matchparams *p, struct timeval *t1)
   free(p->cp.output);
   gal_data_free(p->ccol1);
   gal_data_free(p->ccol2);
+  gal_data_free(p->outcols);
   gal_list_data_free(p->cols1);
   gal_list_data_free(p->cols2);
+  gal_list_str_free(p->acols, 0);
+  gal_list_str_free(p->bcols, 0);
   gal_list_str_free(p->stdinlines, 1);
 
   /* Print the final message.
diff --git a/doc/gnuastro.texi b/doc/gnuastro.texi
index 1cfa4c5..a97e146 100644
--- a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ -18848,6 +18848,14 @@ $ astmatch --aperture=2 input1.txt input2.fits
 $ astmatch --aperture=2 input1.txt input2.fits                   \
            --outcols=a1,aRA,aDEC,b/^MAG/,bBRG,a10
 
+## Assuming both inputs have the same column metadata (same name
+## and numeric type), the output will contain all the rows of the
+## first input, appended with the non-matching rows of the second
+## input (good when you need to merge multiple catalogs that
+## may have matching items, which you don't want to repeat).
+$ astmatch input1.fits input2.fits --ccol1=RA,DEC --ccol2=RA,DEC \
+           --aperture=1/3600 --notmatched --outcols=_all
+
 ## Match the two catalogs within an elliptical aperture of 1 and 2
 ## arc-seconds along RA and Dec respectively.
 $ astmatch --aperture=1/3600,2/3600 in1.fits in2.txt
@@ -18906,9 +18914,15 @@ The extension/HDU of the second input if it is a FITS 
file.
 When it isn't a FITS file, this option's value is ignored.
 For the first input, the common option @option{--hdu} must be used.
 
-@item --outcols=STR
+@item --outcols=STR[,STR,[...]]
 Columns (from both inputs) to write into a single matched table output.
-The value to @code{--outcols} must be a comma-separated list of strings.
+The value to @code{--outcols} must be a comma-separated list of column 
identifiers (number or name, see @ref{Selecting table columns}).
+The expected format depends on @option{--notmatched} and explained below.
+By default (when @option{--nomatched} is not called), the number of rows in 
the output will be euqal to the number of matches.
+However, when @option{--notmatched} is called, all the rows (from the 
requested columns) of the first input are placed in the output, and the 
not-matched rows of the second input are inserted afterwards (useful when you 
want to merge unique entries of multiple catalogs into one).
+
+@table @asis
+@item Default (only matching rows)
 The first character of each string specifies the input catalog: @option{a} for 
the first and @option{b} for the second.
 The rest of the characters of the string will be directly used to identify the 
proper column(s) in the respective table.
 See @ref{Selecting table columns} for how columns can be specified in Gnuastro.
@@ -18935,6 +18949,25 @@ So column names aren't defined and you can only 
request integer column numbers t
 For example if you want to find the row matching RA of 1.2345 and Dec of 
6.7890, then you should use @option{--coord=1.2345,6.7890}.
 But when using @option{--outcols}, you can't give @code{bRA}, or @code{b25}.
 
+@item With @option{--notmatched}
+Only the column names/numbers should be given (for example 
@option{--outcols=RA,DEC,MAGNITUDE}).
+It is assumed that both input tables have the requested column(s) and that the 
numerical datatypes of each column in each input (with same name) is the same 
as the corresponding column in the other.
+Therefore if one input has a @code{MAGNITUDE} column with a 32-bit floating 
point type, but the @code{MAGNITUDE} column of the other is 64-bit floating 
point, Match will crash with an error.
+The metadata of the columns will come from the first input.
+
+As an example, let's assume @file{input1.txt} and @file{input2.fits} each have 
a different number of columns and rows.
+However, they both have the @code{RA} (64-bit floating point), @code{DEC} 
(64-bit floating point) and @code{MAGNITUDE} (32-bit floating point) columns.
+If @file{input1.txt} has 100 rows and @file{input2.fits} has 300 rows (such 
that 50 of them match within 1 arcsec of the first), then the output of the 
command above will have @mymath{100+(300-50)=350} rows and only three columns.
+Other columns in each catalog, which may be different, are ignored.
+
+@example
+$ astmatch input1.txt  --ccol1=RA,DEC \
+           input2.fits --ccol2=RA,DEC \
+           --aperture=1/3600 \
+           --notmatched --outcols=RA,DEC,MAGNITUDE
+@end example
+@end table
+
 @item -l
 @itemx --logasoutput
 The output file will have the contents of the log file: indexes in the two 
catalogs that match with each other along with their distance, see description 
of the log file above.
@@ -18943,9 +18976,9 @@ When this option is called, a separate log file will 
not be created and the outp
 
 @item --notmatched
 Write the non-matching rows into the outputs, not the matched ones.
-Note that with this option, the two output tables will not necessarily have 
the same number of rows.
-Therefore, this option cannot be called with @option{--outcols}.
-@option{--outcols} prints mixed columns from both inputs, so they must all 
have the same number of elements and must correspond to each other.
+By default, this will produce two output tables, that will not necessarily 
have the same number of rows.
+However, when called with @option{--outcols}, its possible to import 
non-matching rows of the second into the first.
+See the description of @option{--outcols} for more.
 
 @item -c INT/STR[,INT/STR]
 @itemx --ccol1=INT/STR[,INT/STR]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]