gnuastro-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuastro-commits] master 48d760d 12/19: Library (match.h): k-d tree met


From: Mohammad Akhlaghi
Subject: [gnuastro-commits] master 48d760d 12/19: Library (match.h): k-d tree method discards regions with no overlap
Date: Sun, 14 Nov 2021 20:40:59 -0500 (EST)

branch: master
commit 48d760dd6fa3b335bf19d0296a0917c04ca7fe8e
Author: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Commit: Mohammad Akhlaghi <mohammad@akhlaghi.org>

    Library (match.h): k-d tree method discards regions with no overlap
    
    Until now, the k-d method would attempt to parse the full k-d of the first
    catalog for each item of the second catalog. When the two catalogs
    spatially overlap significantly this is fine (because the k-d tree will
    find the nearest neighbor pretty fast and we can reject it if it is out of
    the user's given aperture).
    
    However, when the two datasets don't significantly overlap (and catalog B
    covers a much larger spatial area than catalog A), this can be very
    inefficient. Because the k-d tree parsing algorithm will be forced to parse
    the whole tree for every point! I encountered such a case while testing the
    Mach program on some large tables (390165 and 5773552 rows respectively)
    provided by Andres Del Pino Molina.
    
    With this commit, a solution to this problem has been implemented: before
    attempting the parallelization of the B catalog rows, we will first obtain
    A's histograms in each dimention. Later, when parsing the rows of catalog
    B, we will first check if the point in B actually falls in a non-zero bin
    of the N-dimensional histogram. If it doesn't we won't even bother calling
    the k-d tree algorithm. The discussion on the matching algorithms in the
    book now also mentions this new aspect of the k-d tree matching.
    
    This resulted in a MAJOR speed improvement for the catalogs given by
    Andres: from about +20 minutes (we stopped it manually!), to less than 20
    seconds!
---
 THANKS                       |   1 +
 bin/match/ui.c               |  13 ++--
 doc/announce-acknowledge.txt |   1 +
 doc/gnuastro.texi            |  82 ++++++++++++++--------
 lib/match.c                  | 164 ++++++++++++++++++++++++++++++++++---------
 5 files changed, 193 insertions(+), 68 deletions(-)

diff --git a/THANKS b/THANKS
index 5220023..1d27c4c 100644
--- a/THANKS
+++ b/THANKS
@@ -36,6 +36,7 @@ support in Gnuastro. The list is ordered alphabetically (by 
family name).
     Nushkia Chamba                       chamba@iac.es
     Benjamin Clement                     benjamin.clement@univ-lyon1.fr
     Nima Dehdilani                       nimadehdilani@gmail.com
+    Andres Del Pino Molina               adelpino@cefca.es
     Antonio Diaz Diaz                    antonio@gnu.org
     Alexey Dokuchaev                     danfe@freebsd.org
     Pierre-Alain Duc                     pierre-alain.duc@astro.unistra.fr
diff --git a/bin/match/ui.c b/bin/match/ui.c
index f260dba..69d5004 100644
--- a/bin/match/ui.c
+++ b/bin/match/ui.c
@@ -273,7 +273,7 @@ ui_check_options_and_arguments(struct matchparams *p)
       /* Make sure no second argument is given. */
       if(p->input2name)
         error(EXIT_FAILURE, 0, "only one argument can be given with the "
-              "'--coord' of '--kdtree=build' options");
+              "'--coord' or '--kdtree=build' options");
 
       /* In case '--ccol2' is given. */
       if(p->ccol2 && p->cp.quiet==0)
@@ -884,6 +884,7 @@ ui_read_columns(struct matchparams *p)
      than the first, print a warning to let the user know that the speed
      can be greatly improved if they swap the two. */
   if( !p->cp.quiet
+      && p->kdtreemode!=MATCH_KDTREE_BUILD
       && p->kdtreemode!=MATCH_KDTREE_DISABLE
       && p->cols1->size > p->cols2->size )
     error(EXIT_SUCCESS, 0, "TIP: the matching speed will GREATLY IMPROVE "
@@ -1220,15 +1221,17 @@ ui_read_check_inputs_setup(int argc, char *argv[], 
struct matchparams *p)
                ? " (sort-based match only uses a single thread)" : ""));
       printf("  - Match algorithm: %s\n",
              p->kdtree ? "k-d tree" : "sort-based");
-      printf("  - Input-1: %s\n",
-             gal_fits_name_save_as_string(p->input1name, p->cp.hdu));
+      printf("  - Input-1: %s; %zu rows\n",
+             gal_fits_name_save_as_string(p->input1name, p->cp.hdu),
+             p->cols1->size);
       if(p->kdtreemode==MATCH_KDTREE_FILE)
         printf("  - Input-1 k-d tree: %s\n",
                gal_fits_name_save_as_string(p->kdtree, p->kdtreehdu));
       if(p->kdtreemode!=MATCH_KDTREE_BUILD)
-        printf("  - Input-2: %s\n",
+        printf("  - Input-2: %s; %zu rows\n",
                p->coord ? "from --coord"
-               : gal_fits_name_save_as_string(p->input2name, p->hdu2));
+               : gal_fits_name_save_as_string(p->input2name, p->hdu2),
+               p->cols2->size);
     }
 }
 
diff --git a/doc/announce-acknowledge.txt b/doc/announce-acknowledge.txt
index 0f53877..1878711 100644
--- a/doc/announce-acknowledge.txt
+++ b/doc/announce-acknowledge.txt
@@ -1,5 +1,6 @@
 Alphabetically ordered list to acknowledge in the next release.
 
+Andres Del Pino Molina
 Sepideh Eskandarlou
 Zahra Hosseini
 Raúl Infante-Sainz
diff --git a/doc/gnuastro.texi b/doc/gnuastro.texi
index 566de0e..89c917a 100644
--- a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ -20002,13 +20002,37 @@ The k-d tree concept is much more abstract, but 
powerful (for example enabling p
 In short a k-d tree is a partitioning of a k-dimentional space.
 For more on the k-d tree and Gnuastro's implementation of it, please see 
@ref{K-d tree}.
 
-To avoid getting into too much theory, let's look at it from a user's 
perspective: Match will internally construct a k-d tree for catalog A (the 
first catalog given to it).
-Optionally, you can also build the k-d tree of A with a separate call to Match 
(with @option{--kdtree=build} and @option{--output=mykdtree.fits}).
-This external k-d tree can be fed to Match later, when doing your matches (to 
avoid having to reconstruct it every time you want to match with the same first 
catalog; using @option{--kdtree=mykdtree.fits}).
-
-Each thread on your CPU will then be associated a list of rows from catalog B 
and the k-d tree is parsed independently for those to find the nearest element 
of A to that row in B.
+To avoid going too deep into theory, let's look at it from a user's 
perspective: the k-d tree of table A is another table with the same number of 
rows, but only two integer columns (the integers contain the row indexs of the 
left and right branch of that row).
+Match will internally construct a k-d tree for catalog A (the first catalog 
given to it) and use it for matching with catalog B.
+However, optionally, you can also build the k-d tree of A and save it into a 
file, with a separate call to Match (with @option{--kdtree=build} and 
@option{--output=mykdtree.fits}).
+This external k-d tree can be fed to Match later (to avoid having to 
reconstruct it every time you want to match with the same first catalog; using 
@option{--kdtree=mykdtree.fits}).
+
+For each point in catalog B, parsing the k-d tree will enable discovery of the 
nearest point in catalog A.
+But the advantage (compared to parsing from the first element of A to the 
last, for the nearest point) is that the k-d tree will find the nearest item 
with much fewer (statistically) checks.
+
+Therefore, the rows of catalog B will be distributed within the T threads of 
your CPU.
+The k-d tree is parsed independently for each row of B in each thread, to find 
the nearest element of A to that row in B.
+Once all threads are finished the inverse checking process mentioned above 
will be implemented to find the best matchs.
+
+There is just one technical issue however: when there is no neighbor within 
the acceptable distance of the k-d tree, it is forced to parse all elements.
+Therefore if one catalog only overlaps with the larger catalog in a small 
area, the algorithm will be forced to parse the full k-d tree for teh majority 
of points!
+This will dramatically decrease the running speed of Match.
+Therefore, Match first divides the range of the first input in all its 
dimensions into 1000 bins (similar to a histogram), and will only do the k-d 
tree based search when the point in catalog B actually falls within that bin.
 @end table
 
+@noindent
+In summary, here are the take-home messages.
+@itemize
+@item
+For larger datasets, the k-d tree based method (when running on all threads 
possible) is much more faster than the classical sort-based method.
+@item
+The k-d tree is constructed for the first input table and the multi-threading 
is done on the rows of the second input table.
+The construction of a larger dataset's k-d tree will take longer, but 
multi-threading will work better when you have more rows.
+As a result, the optimal way to run match is to give the smaller table as the 
first argument (so its k-d tree is constructed), and the larger table as the 
second argument (so its rows are checked in parallel).
+@item
+If you always need to match against one catalog (that is large!), the k-d tree 
construction itself can take a significant fraction of the running time.
+Therefore you can save its k-d tree into a file and simply give it to later 
calls.
+@end itemize
 
 @node Invoking astmatch,  , Matching algorithms, Match
 @subsection Invoking Match
@@ -29696,66 +29720,62 @@ dataset. The flags have to be set after this function 
any way.
 @end deftypefun
 
 @deftypefun {gal_data_t *} gal_statistics_regular_bins (gal_data_t 
@code{*input}, gal_data_t @code{*inrange}, size_t @code{numbins}, double 
@code{onebinstart})
-Generate an array of regularly spaced elements as a 1D array (column) of
-type @code{double} (i.e., @code{float64}, it has to be double to account
-for small differences on the bin edges). The input arguments are described
-below
+Generate an array of regularly spaced elements as a 1D array (column) of type 
@code{double} (i.e., @code{float64}, it has to be double to account for small 
differences on the bin edges).
+The input arguments are described below
 
 @table @code
 @item input
-The dataset you want to apply the bins to. This is only necessary if the
-range argument is not complete, see below. If @code{inrange} has all the
-necessary information, you can pass a @code{NULL} pointer for this.
+The dataset you want to apply the bins to.
+This is only necessary if the range argument is not complete, see below.
+If @code{inrange} has all the necessary information, you can pass a 
@code{NULL} pointer for this.
 
 @item inrange
-This dataset keeps the desired range along each dimension of the input data
-structure, it has to be in @code{float} (i.e., @code{float32}) type.
+This dataset keeps the desired range along each dimension of the input data 
structure, it has to be in @code{float} (i.e., @code{float32}) type.
 
 @itemize
 @item
-If you want the full range of the dataset (in any dimensions, then just set
-@code{inrange} to @code{NULL} and the range will be specified from the
-minimum and maximum value of the dataset (@code{input} cannot be
-@code{NULL} in this case).
+If you want the full range of the dataset (in any dimensions, then just set 
@code{inrange} to @code{NULL} and the range will be specified from the minimum 
and maximum value of the dataset (@code{input} cannot be @code{NULL} in this 
case).
 
 @item
-If there is one element for each dimension in range, then it is viewed as a
-quantile (Q), and the range will be: `Q to 1-Q'.
+If there is one element for each dimension in range, then it is viewed as a 
quantile (Q), and the range will be: `Q to 1-Q'.
 
 @item
-If there are two elements for each dimension in range, then they are
-assumed to be your desired minimum and maximum values. When either of the
-two are NaN, the minimum and maximum will be calculated for it.
+If there are two elements for each dimension in range, then they are assumed 
to be your desired minimum and maximum values.
+When either of the two are NaN, the minimum and maximum will be calculated for 
it.
 @end itemize
 
 @item numbins
 The number of bins: must be larger than 0.
 
 @item onebinstart
-A desired value for onebinstart. Note that with this option, the bins won't
-start and end exactly on the given range values, it will be slightly
-shifted to accommodate this request.
+A desired value for onebinstart.
+Note that with this option, the bins won't start and end exactly on the given 
range values, it will be slightly shifted to accommodate this request.
+If you don't have any preference on where to start a bin, set this to NAN.
 @end table
 @end deftypefun
 
 
 @deftypefun {gal_data_t *} gal_statistics_histogram (gal_data_t @code{*input}, 
gal_data_t @code{*bins}, int @code{normalize}, int @code{maxone})
 @cindex Histogram
-Make a histogram of all the elements in the given dataset with bin values that 
are defined in the @code{inbins} structure (see 
@code{gal_statistics_regular_bins}, they currently have to be equally spaced).
-@code{inbins} is not mandatory, if you pass a @code{NULL} pointer, the bins 
structure will be built within this function based on the @code{numbins} input.
-As a result, when you have already defined the bins, @code{numbins} is not 
used.
+Make a histogram of all the elements in the given dataset with bin values that 
are defined in the @code{bins} structure (see 
@code{gal_statistics_regular_bins}, they currently have to be equally spaced).
+The retured histogram is a 1-D @code{gal_data_t} of type 
@code{GAL_TYPE_FLOAT32}, with the same number of elements as @code{bins}.
+For each bin, it will contain the number of input elements that fell inside of 
that bin.
 
 Let's write the center of the @mymath{i}th element of the bin array as 
@mymath{b_i}, and the fixed half-bin width as @mymath{h}.
 Then element @mymath{j} of the input array (@mymath{in_j}) will be counted in 
@mymath{b_i} if @mymath{(b_i-h) \le in_j < (b_i+h)}.
 However, if @mymath{in_j} is somewhere in the last bin, the condition changes 
to @mymath{(b_i-h) \le in_j \le (b_i+h)}.
 
-If @code{normalize!=0}, the histogram will be ``normalized'' such that the sum 
of the counts column will be one. In other words, all the counts in every bin 
will be divided by the total number of counts. If @code{maxone!=0}, the 
histogram's maximum count will be 1. In other words, the counts in every bin 
will be divided by the value of the maximum.
+If @code{normalize!=0}, the histogram will be ``normalized'' such that the sum 
of the counts column will be one.
+In other words, all the counts in every bin will be divided by the total 
number of counts.
+If @code{maxone!=0}, the histogram's maximum count will be 1.
+In other words, the counts in every bin will be divided by the value of the 
maximum.
+In both of these cases, the output dataset will have a @code{GAL_DATA_FLOAT32} 
datatype.
 @end deftypefun
 
 @deftypefun {gal_data_t *} gal_statistics_histogram2d (gal_data_t 
@code{*input}, gal_data_t @code{*bins})
 @cindex Histogram, 2D
 @cindex 2D histogram
-This function is very similar to @code{gal_statistics_histogram}, but will 
build a 2D histogram (count how many of the elements of @code{input} have a 
within a 2D box.
+This function is very similar to @code{gal_statistics_histogram}, but will 
build a 2D histogram (count how many of the elements of @code{input} are a 
within a 2D box.
 The bins comprising the first dimension of the 2D box are defined by 
@code{bins}.
 The bins of the second dimension are defined by @code{bins->next} (@code{bins} 
is a @ref{List of gal_data_t}).
 Both the @code{bin} and @code{bin->next} can be created with 
@code{gal_statistics_regular_bins}.
diff --git a/lib/match.c b/lib/match.c
index dbea82d..bc6bf40 100644
--- a/lib/match.c
+++ b/lib/match.c
@@ -37,6 +37,7 @@ along with Gnuastro. If not, see 
<http://www.gnu.org/licenses/>.
 #include <gnuastro/kdtree.h>
 #include <gnuastro/pointer.h>
 #include <gnuastro/threads.h>
+#include <gnuastro/statistics.h>
 #include <gnuastro/permutation.h>
 
 
@@ -986,7 +987,12 @@ struct match_kdtree_params
   /* Internal items. */
   double              *a[3];  /* Direct pointers to column arrays.    */
   double              *b[3];  /* Direct pointers to column arrays.    */
-  struct match_sfll  **bina; /* Second cat. items in first. */
+  struct match_sfll  **bina;  /* Second cat. items in first.          */
+  gal_data_t         *Abins;  /* Bins along each dimension of A.      */
+  gal_data_t        *Aexist;  /* Bins along each dimension of A.      */
+  double         *Abinwidth;  /* Width of bins along each dimension.  */
+  double              *Amin;  /* Minimum value of A along each dim.   */
+  double              *Amax;  /* Maximum value of A along each dim.   */
 };
 
 
@@ -996,7 +1002,9 @@ struct match_kdtree_params
 static void
 match_kdtree_sanity_check(struct match_kdtree_params *p)
 {
-  gal_data_t *tmp;
+  double *d;
+  gal_data_t *tmp, *hist, *bins;
+  size_t *s, *sf, dim, numbins=1000;
 
   /* Make sure all coordinates and the k-d tree have the same number of
      rows. */
@@ -1051,6 +1059,58 @@ match_kdtree_sanity_check(struct match_kdtree_params *p)
       p->b[1]=p->B->next->array;
       if( p->B->next->next ) p->b[2]=p->B->next->next->array;
     }
+
+  /* Allocate the space to keep the range of first input dimensions. */
+  p->Amin=gal_pointer_allocate(GAL_TYPE_FLOAT64, p->ndim, 0,
+                               __func__, "p->Amin");
+  p->Amax=gal_pointer_allocate(GAL_TYPE_FLOAT64, p->ndim, 0,
+                               __func__, "p->Amax");
+  p->Abinwidth=gal_pointer_allocate(GAL_TYPE_FLOAT64, p->ndim, 0,
+                                    __func__, "p->Abinwidth");
+
+  /* Find the bins of the first input along all its dimensions and select
+     those that contain data. This is very important in optimal k-d tree
+     based matching because confirming a non-match in a k-d tree is very
+     computationally expensive. */
+  dim=0;
+  p->Abins=p->Aexist=NULL;
+  for(tmp=p->A; tmp!=NULL; tmp=tmp->next)
+    {
+      /* Generate the histogram of elements in this dimension. */
+      bins=gal_statistics_regular_bins(tmp, NULL, numbins, NAN);
+      hist=gal_statistics_histogram(tmp, bins, 0, 0);
+
+      /* Set all histograms with atleast one element to 1 and convert it to
+         8-bit unsigned integer. */
+      sf = (s=hist->array) + hist->size; do *s=*s>0; while(++s<sf);
+      hist=gal_data_copy_to_new_type_free(hist, GAL_TYPE_UINT8);
+
+      /* Set the general bin properties along this dimension. */
+      d=bins->array;
+      p->Abinwidth[dim] = d[1]-d[0];
+      p->Amin[dim]      = d[ 0            ] - p->Abinwidth[dim]/2;
+      p->Amax[dim]      = d[ bins->size-1 ] + p->Abinwidth[dim]/2;
+
+      /* For a check:
+      size_t i;
+      double *d=bins->array;
+      uint8_t *u=hist->array;
+      for(i=0;i<bins->size;++i)
+        printf("%-15.8f%-15.8f%u\n", d[i]-p->Abinwidth[dim]/2,
+               d[i]+p->Abinwidth[dim]/2, u[i]);
+      printf("\n"); */
+
+      /* Add the hist and bins to the list for later. */
+      gal_list_data_add(&p->Abins, bins);
+      gal_list_data_add(&p->Aexist, hist);
+
+      /* Increment the dimensionality. */
+      ++dim;
+    }
+
+  /* Reverse the list to be in the proper dimensional order. */
+  gal_list_data_reverse(&p->Abins);
+  gal_list_data_reverse(&p->Aexist);
 }
 
 
@@ -1066,12 +1126,14 @@ match_kdtree_worker(void *in_prm)
   struct match_kdtree_params *p=(struct match_kdtree_params *)tprm->params;
 
   /* High level definitions. */
-  gal_data_t *ccol;
+  int inrange;
+  uint8_t *existA;
   double r, delta[3];
-  size_t i, j, ai, bi;
-  double *point=NULL, least_dist;
+  size_t i, j, ai, bi, h_i;
+  gal_data_t *ccol, *Aexist;
+  double po, *point=NULL, least_dist;
 
-  /* Allqocate space for all the matching points (based on the number of
+  /* Allocate space for all the matching points (based on the number of
      dimensions). */
   point=gal_pointer_allocate(GAL_TYPE_FLOAT64, p->ndim, 0, __func__,
                              "point");
@@ -1080,35 +1142,68 @@ match_kdtree_worker(void *in_prm)
      thread. */
   for(i=0; tprm->indexs[i] != GAL_BLANK_SIZE_T; ++i)
     {
-      /* Fill the 'point' for this thread (note that this is the index in
-         the second catalog, hence 'bi'). */
-      j=0;
+      /* Set the easy-to-read indexs: note that this is the index in the
+         second catalog, hence 'bi'. */
       bi = tprm->indexs[i];
+
+      /* Fill the 'point' for this thread, and check if each of its
+         dimensions fall within the coverage of A. */
+      j=0;
+      inrange=1;
+      Aexist=p->Aexist;
       for(ccol=p->B; ccol!=NULL; ccol=ccol->next)
-        point[ j++ ] = ((double *)(ccol->array))[ bi ];
-
-      /* Find the index of the nearest neighbor in the first catalog to
-         this point in the second catalog. */
-      ai = gal_kdtree_nearest_neighbour(p->A, p->A_kdtree,
-                                        p->kdtree_root, point,
-                                        &least_dist);
-
-      /* Make sure the matched point is within the given aperture (which
-         may be elliptical). If it isn't, then the reported index for this
-         element will be 'GAL_BLANK_SIZE_T'. */
-      for(j=0;j<p->ndim;++j)
-        delta[j]=p->b[j][bi] - p->a[j][ai];
-      r=match_distance(delta, p->iscircle, p->ndim, p->aperture,
-                       p->c, p->s);
-
-      /* If the radial distance is smaller than the radial measure, then
-         add this item to a match with 'ai'. */
-      if(r<p->aperture[0])
-        match_add_to_sfll(&p->bina[ai], bi, r);
+        {
+          if(inrange)
+            {
+              /* Fill the point location in this dimension and set the
+                 pointer. */
+              existA=Aexist->array;
+              po = point[j] = ((double *)(ccol->array))[ bi ];
+
+              /* Make sure it covers the range of A (following the same set
+                 of tests as in 'gal_statistics_histogram'). */
+              if( (    po >= p->Amin[j] - p->aperture[0] )
+                  && ( po <= p->Amax[j] + p->aperture[0] ) )
+                {
+                  h_i=(po-p->Amin[j])/p->Abinwidth[j];
+                  if( existA[ h_i - (h_i==p->Aexist->size ? 1 : 0) ] == 0 )
+                    inrange=0;
+                }
+              else
+                inrange=0;
+            }
 
-      /* For a check:
-      printf("first[%zu] matched with second[%zu]\n", ai, bi);
-      */
+          /* Increment the dimensionality counter. */
+          ++j;
+          Aexist=Aexist->next;
+        }
+
+      /* Continue with the match if the point is in-range. */
+      if(inrange)
+        {
+          /* Find the index of the nearest neighbor in the first catalog to
+             this point in the second catalog. */
+          ai = gal_kdtree_nearest_neighbour(p->A, p->A_kdtree,
+                                            p->kdtree_root, point,
+                                            &least_dist);
+
+          /* Make sure the matched point is within the given aperture
+             (which may be elliptical). If it isn't, then the reported
+             index for this element will be 'GAL_BLANK_SIZE_T'. */
+          for(j=0;j<p->ndim;++j)
+            delta[j]=p->b[j][bi] - p->a[j][ai];
+          r=match_distance(delta, p->iscircle, p->ndim, p->aperture,
+                           p->c, p->s);
+
+          /* If the radial distance is smaller than the radial measure,
+             then add this item to a match with 'ai'. */
+          if(r<p->aperture[0])
+            match_add_to_sfll(&p->bina[ai], bi, r);
+
+          /* For a check:
+             printf("first[%zu] matched with second[%zu]\n", ai, bi);
+          */
+        }
     }
 
   /* Clean up. */
@@ -1178,5 +1273,10 @@ gal_match_kdtree(gal_data_t *coord1, gal_data_t *coord2,
 
   /* Clean up and return. */
   free(p.bina);
+  free(p.Amin);
+  free(p.Amax);
+  free(p.Abinwidth);
+  gal_list_data_free(p.Abins);
+  gal_list_data_free(p.Aexist);
   return out;
 }



reply via email to

[Prev in Thread] Current Thread [Next in Thread]