gnuastro-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuastro-commits] master 1bac6c4: Segment: will not consume excessive R


From: Mohammad Akhlaghi
Subject: [gnuastro-commits] master 1bac6c4: Segment: will not consume excessive RAM when there are many clumps
Date: Wed, 16 Dec 2020 00:03:44 -0500 (EST)

branch: master
commit 1bac6c43b30e1f3d5ab605b839cfd63d5013f44e
Author: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Commit: Mohammad Akhlaghi <mohammad@akhlaghi.org>

    Segment: will not consume excessive RAM when there are many clumps
    
    Until now, when the input detection map had many (on the scale of hundreds
    of thousands) clumps within one detection, it would consume an unreasonable
    amount of RAM (which could exceed even the nonvolatile memory on a system!
    on the order of 1.5 Terrabytes in one of our examples!).
    
    This happened when working on a wide field of view image near the disk of
    the milky way (which has MANY stars), we noticed that one detection
    included +600000 clumps and during its processing, Segment needed more than
    1Tb of space which wasn't available in the RAM. It then tried to make a
    memory-mapped file, but the storage wasn't sufficient to host that either
    and Segment crashed.
    
    The main problem was in the adjacency matrix step of Segment: until now if
    there were N clumps in a detection, it would construct a NxN adjacency
    matrix to find the connected clumps (which is why the memory usage went so
    high).
    
    With this commit, the new 'gal_binary_connected_adjacency_list' has been
    added to avoid this problem. This new function will produce an identical
    output to 'gal_binary_connected_adjacency_matrix', but it won't use a
    matrix, it will use a list of indexs. It can be slightly slower than using
    the adjacency matrix but consumes FAR LESS memory when large numbers are
    involved.
    
    This bug was found with the help of Samane Raji.
---
 NEWS                  |   3 +
 bin/segment/segment.c | 487 ++++++++++++++++++++++++++++++++++++++------------
 doc/gnuastro.texi     |  58 ++++--
 lib/binary.c          |  79 +++++++-
 lib/gnuastro/binary.h |   4 +
 5 files changed, 500 insertions(+), 131 deletions(-)

diff --git a/NEWS b/NEWS
index 1b81997..445aaf0 100644
--- a/NEWS
+++ b/NEWS
@@ -107,6 +107,8 @@ See the end of the file for license conditions.
 
   Library:
    - GAL_ARITHMETIC_OP_MAKENEW: new 'makenew' operator.
+   - gal_binary_connected_adjacency_list: finding connected components
+     without a square adjacency matrix (which can consume major RAM).
    - gal_blank_flag_remove: Remove all flagged elements in a dataset.
    - gal_blank_remove_rows: remove all rows that have at least one blank.
    - gal_fits_hdu_datasum: calculate DATASUM of given HDU in given FITS file.
@@ -176,6 +178,7 @@ See the end of the file for license conditions.
   bug #59400: CosmicCalculator fails --printparams when redshift isn't given.
   bug #59459: Unclear WCS when both PC and CD exist in input, but conflict.
   bug #59625: MakeProfiles uses last --kernel, if it is called more than once.
+  bug #59700: Segment's excessive RAM usage when many clumps over a detection.
 
 
 
diff --git a/bin/segment/segment.c b/bin/segment/segment.c
index 8f88554..851e8da 100644
--- a/bin/segment/segment.c
+++ b/bin/segment/segment.c
@@ -215,7 +215,7 @@ segment_relab_noseg(struct clumps_thread_params *cltprm)
    a*numobjs+b or b*numobjs+a and get the answer. Since the number of
    objects in a given region will not be too high, this is efficient. */
 static void
-segment_relab_to_objects(struct clumps_thread_params *cltprm)
+segment_relab_to_objects_array(struct clumps_thread_params *cltprm)
 {
   size_t amwidth=cltprm->numtrueclumps+1;
   struct segmentparams *p=cltprm->clprm->p;
@@ -231,11 +231,12 @@ segment_relab_to_objects(struct clumps_thread_params 
*cltprm)
   gal_data_t *adjacency_d=gal_data_alloc(NULL, GAL_TYPE_UINT8, 2, mdsize,
                                          NULL, 1, p->cp.minmapsize,
                                          p->cp.quietmmap, NULL, NULL, NULL);
+
   float *imgss=p->input->array;
+  int32_t *olabel=p->olabel->array;
   double var=cltprm->std*cltprm->std;
   uint8_t *adjacency=adjacency_d->array;
   size_t nngb=gal_dimension_num_neighbors(ndim);
-  int32_t *clumptoobj, *olabel=p->olabel->array;
   size_t *dinc=gal_dimension_increment(ndim, dsize);
   size_t *s, *sf, i, j, ii, rpnum, *nums=nums_d->array;
   double ave, rpsum, c=sqrt(1/p->cpscorr), *sums=sums_d->array;
@@ -249,130 +250,398 @@ segment_relab_to_objects(struct clumps_thread_params 
*cltprm)
      matrix. Note that at this point, the rivers are also part of the
      "diffuse" regions. So we don't need to go over all the indexs of this
      object, only its diffuse indexs. */
-  if(cltprm->diffuseindexs->size)
-    {
-      sf=(s=cltprm->diffuseindexs->array)+cltprm->diffuseindexs->size;
-      do
-        /* We only want to work on pixels that have already been identified
-           as touching more than one label: river pixels. */
-        if( olabel[ *s ]==GAL_LABEL_RIVER )
-          {
-            /* Initialize the values. */
-            i=ii=0;
-            rpnum=1;              /* River-pixel number of points used. */
-            rpsum=imgss[*s];      /* River-pixel sum of values used.    */
-            memset(ngblabs, 0, nngb*sizeof *ngblabs);
-
-            /* Check all the fully-connected neighbors of this pixel and
-               see if it touches a label or not */
-            GAL_DIMENSION_NEIGHBOR_OP(*s, ndim, dsize, ndim, dinc, {
-                if( olabel[nind] > 0 )
-                  {
-                    /* Add this neighbor's value and increment the number. */
-                    if( !isnan(imgss[nind]) ) { ++rpnum; rpsum+=imgss[nind]; }
+  sf=(s=cltprm->diffuseindexs->array)+cltprm->diffuseindexs->size;
+  do
+    /* We only want to work on pixels that have already been identified as
+       touching more than one label: river pixels. */
+    if( olabel[ *s ]==GAL_LABEL_RIVER )
+      {
+        /* Initialize the values. */
+        i=ii=0;
+        rpnum=1;              /* River-pixel number of points used. */
+        rpsum=imgss[*s];      /* River-pixel sum of values used.    */
+        memset(ngblabs, 0, nngb*sizeof *ngblabs);
+
+        /* Check all the fully-connected neighbors of this pixel and
+           see if it touches a label or not */
+        GAL_DIMENSION_NEIGHBOR_OP(*s, ndim, dsize, ndim, dinc, {
+            if( olabel[nind] > 0 )
+              {
+                /* Add this neighbor's value and increment the number. */
+                if( !isnan(imgss[nind]) ) { ++rpnum; rpsum+=imgss[nind]; }
 
-                    /* Go over the already found neighbors and see if this
-                       grown clump has already been considered or not. */
-                    for(i=0;i<ii;++i) if(ngblabs[i]==olabel[nind]) break;
+                /* Go over the already found neighbors and see if this
+                   grown clump has already been considered or not. */
+                for(i=0;i<ii;++i) if(ngblabs[i]==olabel[nind]) break;
 
-                    /* This is the first time we are getting to this
-                       neighbor: */
-                    if(i==ii) ngblabs[ ii++ ] = olabel[nind];
-                  }
-              } );
+                /* This is the first time we are getting to this
+                   neighbor: */
+                if(i==ii) ngblabs[ ii++ ] = olabel[nind];
+              }
+          } );
+
+        /* For a check:
+        if(ii>0)
+          {
+            printf("%zu, %zu:\n", *s%dsize[1]+1, *s/dsize[1]+1);
+            for(i=0;i<ii;++i) printf("\t%u\n", ngblabs[i]);
+          }
+        */
+
+        /* If more than one neighboring label was found, fill in the
+           'sums' and 'nums' adjacency matrixs with the values for this
+           pixel. Recall that ii is the number of neighboring labels to
+           this river pixel. */
+        if(ii>i)
+          for(i=0;i<ii;++i)
+            for(j=0;j<ii;++j)
+              if(i!=j)
+                {
+                  /* For safety, we will fill both sides of the
+                     diagonal. */
+                  ++nums[ ngblabs[i] * amwidth + ngblabs[j] ];
+                  ++nums[ ngblabs[j] * amwidth + ngblabs[i] ];
+                  sums[ ngblabs[i] * amwidth + ngblabs[j] ] +=
+                    rpsum/rpnum;
+                  sums[ ngblabs[j] * amwidth + ngblabs[i] ] +=
+                    rpsum/rpnum;
+                }
+      }
+  while(++s<sf);
 
-            /* For a check:
-            if(ii>0)
+  /* We now have the average values and number of all rivers between
+     the grown clumps. We now want to finalize their connection (given
+     the user's criteria). */
+  for(i=1;i<amwidth;++i)
+    for(j=1;j<i;++j)
+      {
+        ii = i * amwidth + j;
+        if(nums[ii]>p->minriverlength)       /* There is a connection. */
+          {
+            /* For easy reading. */
+            ave=sums[ii]/nums[ii];
+
+            /* In case the average is negative (only possible if 'sums'
+               is negative), don't change the adjacency: it is already
+               initialized to zero. Note that even an area of 1 is
+               acceptable, and we put no area criteria here, because
+               the fact that a river exists between two clumps is
+               important. */
+            if( ave>0.0f && ( c * ave / sqrt(ave+var) ) > p->objbordersn )
               {
-                printf("%zu, %zu:\n", *s%dsize[1]+1, *s/dsize[1]+1);
-                for(i=0;i<ii;++i) printf("\t%u\n", ngblabs[i]);
+                adjacency[ii]=1;   /* We want to set both sides of the */
+                adjacency[ j * amwidth + i ] = 1; /* Symmetric matrix. */
               }
-            */
-
-            /* If more than one neighboring label was found, fill in the
-               'sums' and 'nums' adjacency matrixs with the values for this
-               pixel. Recall that ii is the number of neighboring labels to
-               this river pixel. */
-            if(ii>i)
-              for(i=0;i<ii;++i)
-                for(j=0;j<ii;++j)
-                  if(i!=j)
-                    {
-                      /* For safety, we will fill both sides of the
-                         diagonal. */
-                      ++nums[ ngblabs[i] * amwidth + ngblabs[j] ];
-                      ++nums[ ngblabs[j] * amwidth + ngblabs[i] ];
-                      sums[ ngblabs[i] * amwidth + ngblabs[j] ] +=
-                        rpsum/rpnum;
-                      sums[ ngblabs[j] * amwidth + ngblabs[i] ] +=
-                        rpsum/rpnum;
-                    }
           }
-      while(++s<sf);
+      }
 
-
-      /* We now have the average values and number of all rivers between
-         the grown clumps. We now want to finalize their connection (given
-         the user's criteria). */
+  /* For a check:
+  if(cltprm->id==XXX)
+    {
+      printf("=====================\n");
+      printf("%zu:\n--------\n", cltprm->id);
       for(i=1;i<amwidth;++i)
-        for(j=1;j<i;++j)
+        {
+          printf(" %zu...\n", i);
+          for(j=1;j<amwidth;++j)
+            {
+              ii=i*amwidth+j;
+              if(nums[ii])
+                {
+                  ave=sums[ii]/nums[ii];
+                  printf("    ...%zu: N:%-4zu S:%-10.2f S/N: %-10.2f "
+                         "--> %u\n", j, nums[ii], sums[ii],
+                         c*ave/sqrt(ave+var), adjacency[ii]);
+                }
+            }
+          printf("\n");
+        }
+    }
+  */
+
+  /* Calculate the new labels for each grown clump. */
+  cltprm->clumptoobj = gal_binary_connected_adjacency_matrix(adjacency_d,
+                                                     &cltprm->numobjects);
+
+  /* Clean up and return. */
+  free(dinc);
+  free(ngblabs);
+  gal_data_free(nums_d);
+  gal_data_free(sums_d);
+  gal_data_free(adjacency_d);
+}
+
+
+
+
+
+/* For a large number of clumps, 'segment_relab_to_objects_array' will
+   consume too much memory and can completely fill the memory. So we need
+   to use a list-based adjacency solution. */
+typedef struct segment_relab_list_t
+{
+  size_t num;
+  double sum;
+  size_t ngbid;
+  struct segment_relab_list_t *next;
+} segment_relab_list_t;
+
+
+
+
+
+static void
+segment_relab_list_add(struct segment_relab_list_t **list, size_t ngbid,
+                       double value)
+{
+  int done=0;
+  struct segment_relab_list_t *tmp=NULL;
+
+  /* Check if the desired index has already been found or not. Note that if
+     'list' is empty, then it will never enter the loop.*/
+  for(tmp=*list; tmp!=NULL; tmp=tmp->next)
+    if( tmp->ngbid == ngbid )
+      {
+        tmp->sum += value;
+        ++tmp->num;
+        done=1;
+        break;
+      }
+
+  /* Either the list was empty, or the desired index didn't exist in it. So
+     we need to allocate a new node and add it to the list. */
+  if(done==0)
+    {
+      /* Allocate a new node. */
+      errno=0;
+      tmp=malloc(sizeof *tmp);
+      if(tmp==NULL)
+        error(EXIT_FAILURE, errno, "%s: couldn't allocate %zu bytes "
+              "for 'tmp'", __func__, sizeof tmp);
+
+      /* Fill in the node. */
+      tmp->num=1;
+      tmp->sum=value;
+      tmp->ngbid=ngbid;
+
+      /* Put the node at the top of the list. */
+      tmp->next = *list ? *list : NULL;
+      *list=tmp;
+    }
+}
+
+
+
+
+
+static void
+segment_relab_to_objects_list(struct clumps_thread_params *cltprm)
+{
+  size_t amwidth=cltprm->numtrueclumps+1;
+  struct segmentparams *p=cltprm->clprm->p;
+  size_t ndim=p->input->ndim, *dsize=p->input->dsize;
+
+  int addadj;
+  float *imgss=p->input->array;
+  size_t *s, *sf, i, j, ii, rpnum;
+  int32_t *olabel=p->olabel->array;
+  double var=cltprm->std*cltprm->std;
+  gal_list_sizet_t **adjacency, *atmp;
+  segment_relab_list_t **rlist, *rtmp;
+  double ave, rpsum, c=sqrt(1/p->cpscorr);
+  size_t nngb=gal_dimension_num_neighbors(ndim);
+  size_t *dinc=gal_dimension_increment(ndim, dsize);
+  int32_t *ngblabs=gal_pointer_allocate(GAL_TYPE_UINT32, nngb, 0, __func__,
+                                         "ngblabs");
+
+  /* Allocate the two lists to keep the number and sum, as well as the
+     final adjacency matrix. */
+  errno=0;
+  rlist=calloc(amwidth, sizeof *rlist);
+  if(rlist==NULL)
+    error(EXIT_FAILURE, errno, "%s: couldn't allocate %zu bytes for 'rlist'",
+          __func__, amwidth * (sizeof *rlist));
+  errno=0;
+  adjacency=calloc(amwidth, sizeof *adjacency);
+  if(adjacency==NULL)
+    error(EXIT_FAILURE, errno, "%s: couldn't allocate %zu bytes for 
'adjacency'",
+          __func__, amwidth * (sizeof *adjacency));
+
+  /* Go over all the still-unlabeled pixels (if they exist) and see which
+     labels they touch. In the process, get the average value of the
+     river-pixel values and put them in the respective adjacency
+     matrix. Note that at this point, the rivers are also part of the
+     "diffuse" regions. So we don't need to go over all the indexs of this
+     object, only its diffuse indexs. */
+  sf=(s=cltprm->diffuseindexs->array)+cltprm->diffuseindexs->size;
+  do
+    /* We only want to work on pixels that have already been identified as
+       touching more than one label: river pixels. */
+    if( olabel[ *s ]==GAL_LABEL_RIVER )
+      {
+        /* Initialize the values. */
+        i=ii=0;
+        rpnum=1;              /* River-pixel number of points used. */
+        rpsum=imgss[*s];      /* River-pixel sum of values used.    */
+        memset(ngblabs, 0, nngb*sizeof *ngblabs);
+
+        /* Check all the fully-connected neighbors of this pixel and
+           see if it touches a label or not */
+        GAL_DIMENSION_NEIGHBOR_OP(*s, ndim, dsize, ndim, dinc, {
+            if( olabel[nind] > 0 )
+              {
+                /* Add this neighbor's value and increment the number. */
+                if( !isnan(imgss[nind]) ) { ++rpnum; rpsum+=imgss[nind]; }
+
+                /* Go over the already found neighbors and see if this
+                   grown clump has already been considered or not. */
+                for(i=0;i<ii;++i) if(ngblabs[i]==olabel[nind]) break;
+
+                /* This is the first time we are getting to this
+                   neighbor: */
+                if(i==ii) ngblabs[ ii++ ] = olabel[nind];
+              }
+          } );
+
+        /* For a check:
+        if(ii>0)
+          {
+            printf("%zu, %zu:\n", *s%dsize[1]+1, *s/dsize[1]+1);
+            for(i=0;i<ii;++i) printf("\t%u\n", ngblabs[i]);
+          }
+        */
+
+        /* If more than one neighboring label was found, fill in the
+           'sums' and 'nums' adjacency matrixs with the values for this
+           pixel. Recall that ii is the number of neighboring labels to
+           this river pixel. */
+        if(ii>i)
+          for(i=0;i<ii;++i)
+            for(j=0;j<ii;++j)
+              if(i!=j)
+                {
+                  /* For safety and ease of processing, we will fill
+                     both sides of the diagonal. */
+                  segment_relab_list_add(&rlist[ ngblabs[i] ], ngblabs[j],
+                                         rpsum/rpnum);
+                  segment_relab_list_add(&rlist[ ngblabs[j] ], ngblabs[i],
+                                         rpsum/rpnum);
+                }
+      }
+  while(++s<sf);
+
+  /* We now have the average values and number of all rivers between
+     the grown clumps. We now want to finalize their connection (given
+     the user's criteria). */
+  for(i=1; i<amwidth; ++i)
+    for(rtmp=rlist[i]; rtmp!=NULL; rtmp=rtmp->next)
+      {
+        if(rtmp->num > p->minriverlength)  /* There is a connection. */
           {
-            ii = i * amwidth + j;
-            if(nums[ii]>p->minriverlength)       /* There is a connection. */
+            /* For easy reading. */
+            ave = rtmp->sum / rtmp->num;
+
+            /* In case the average is negative (only possible if 'sums'
+               is negative), don't change the adjacency: it is already
+               initialized to zero. Note that even an area of 1 is
+               acceptable, and we put no area criteria here, because
+               the fact that a river exists between two clumps is
+               important. */
+            if( ave>0.0f && ( c * ave / sqrt(ave+var) ) > p->objbordersn )
               {
-                /* For easy reading. */
-                ave=sums[ii]/nums[ii];
-
-                /* In case the average is negative (only possible if 'sums'
-                   is negative), don't change the adjacency: it is already
-                   initialized to zero. Note that even an area of 1 is
-                   acceptable, and we put no area criteria here, because
-                   the fact that a river exists between two clumps is
-                   important. */
-                if( ave>0.0f && ( c * ave / sqrt(ave+var) ) > p->objbordersn )
+                addadj=1;
+                for(atmp=adjacency[i]; atmp!=NULL; atmp=atmp->next)
+                  if(atmp->v==rtmp->ngbid)
+                    { addadj=0; break; }
+                if(addadj)
                   {
-                    adjacency[ii]=1;   /* We want to set both sides of the */
-                    adjacency[ j * amwidth + i ] = 1; /* Symmetric matrix. */
+                    gal_list_sizet_add(&adjacency[i], rtmp->ngbid);
+                    gal_list_sizet_add(&adjacency[rtmp->ngbid], i);
                   }
               }
           }
+      }
 
-
-      /* For a check:
-      if(cltprm->id==XXX)
+  /* For a check:
+  if(cltprm->id==2)
+    {
+      printf("=====================\n");
+      printf("%zu:\n--------\n", cltprm->id);
+      for(i=1; i<amwidth; ++i)
         {
-          printf("=====================\n");
-          printf("%zu:\n--------\n", cltprm->id);
-          for(i=1;i<amwidth;++i)
+          printf(" %zu...\n", i);
+          for(rtmp=rlist[i]; rtmp!=NULL; rtmp=rtmp->next)
             {
-              printf(" %zu...\n", i);
-              for(j=1;j<amwidth;++j)
+              if(rtmp->num)
                 {
-                  ii=i*amwidth+j;
-                  if(nums[ii])
-                    {
-                      ave=sums[ii]/nums[ii];
-                      printf("    ...%zu: N:%-4zu S:%-10.2f S/N: %-10.2f "
-                             "--> %u\n", j, nums[ii], sums[ii],
-                             c*ave/sqrt(ave+var), adjacency[ii]);
-                    }
+                  ave=rtmp->sum/rtmp->num;
+                  printf("    ...%zu: N:%-4zu S:%-10.2f S/N: %-10.2f\n",
+                         rtmp->ngbid, rtmp->num, rtmp->sum,
+                         c*ave/sqrt(ave+var));
                 }
-              printf("\n");
             }
+          printf("FINAL: ");
+          for(atmp=adjacency[i]; atmp!=NULL; atmp=atmp->next)
+            printf("%zu ", atmp->v);
+          printf("\n\n");
         }
-      */
+      exit(0);
+    }
+  */
+
+  /* Calculate the new labels for each grown clump. */
+  cltprm->clumptoobj=gal_binary_connected_adjacency_list(adjacency,
+                                               amwidth, p->cp.minmapsize,
+                                               p->cp.quietmmap,
+                                               &cltprm->numobjects);
+
+  /* Clean up. */
+  for(i=1; i<amwidth; ++i)
+    while(rlist[i]!=NULL)
+      { rtmp=rlist[i]->next; free(rlist[i]); rlist[i]=rtmp; }
+  for(i=1; i<amwidth; ++i) gal_list_sizet_free(adjacency[i]);
+  free(adjacency);
+  free(ngblabs);
+  free(rlist);
+  free(dinc);
+}
+
+
+
+
+
+/* Relabel objects. */
+static void
+segment_relab_to_objects(struct clumps_thread_params *cltprm)
+{
+  struct segmentparams *p=cltprm->clprm->p;
 
+  size_t *s, *sf;
+  size_t i, amwidth=cltprm->numtrueclumps+1;
+  int32_t *clumptoobj, *olabel=p->olabel->array;
 
-      /* Calculate the new labels for each grown clump. */
-      cltprm->clumptoobj = gal_binary_connected_adjacency_matrix(adjacency_d,
-                                                         &cltprm->numobjects);
+  /* Find the final object IDs if there is any list of diffuse pixels. It
+     can happen that we don't have a list of diffuse pixels when the user
+     sets a very high 'gthresh' threshold and wants to make sure that each
+     clump is a separate object. So we need to define the number of objects
+     and 'clumptoobj' manually.*/
+  if(cltprm->diffuseindexs->size)
+    {
+      /* See if we should use a matrix-based adjacent finding (good for
+         small numbers) or a list-based method (necessary for large
+         numbers). Here we'll set the limit to 1000, because of this: the
+         adjacency array will be 1e6 pixels in three types (size_t, double
+         and uint8_t), so it will consume (8+8+1)*1e6 bytes which is 17
+         megabytes and reasonable. But the same argument for a 10000 limit
+         would be 17*e8 bytes or 1.7GB which is not reasonable. Note that
+         cases with +600000 have also been encountered (wide images of
+         dense fields near the Milky way disk). */
+      if( amwidth>1000 )
+        segment_relab_to_objects_list(cltprm);
+      else
+        segment_relab_to_objects_array(cltprm);
       clumptoobj = cltprm->clumptoobj->array;
     }
-
-  /* There was no list of diffuse pixels, this happens when the user sets a
-     very high 'gthresh' threshold and wants to make sure that each clump
-     is a separate object. So we need to define the number of objects and
-     'clumptoobj' manually. */
   else
     {
       /* Allocate the 'clumptoobj' array. */
@@ -388,7 +657,6 @@ segment_relab_to_objects(struct clumps_thread_params 
*cltprm)
       cltprm->numobjects = cltprm->numtrueclumps;
     }
 
-
   /* For a check
   if(cltprm->id==XXXX)
     {
@@ -400,26 +668,18 @@ segment_relab_to_objects(struct clumps_thread_params 
*cltprm)
     }
   */
 
-
   /* Correct all the labels. */
   sf=(s=cltprm->indexs->array)+cltprm->indexs->size;
   do
     if( olabel[*s] > 0 )
       olabel[*s] = clumptoobj[ olabel[*s] ];
   while(++s<sf);
-
-
-  /* Clean up and return. */
-  free(dinc);
-  free(ngblabs);
-  gal_data_free(nums_d);
-  gal_data_free(sums_d);
-  gal_data_free(adjacency_d);
 }
 
 
 
 
+
 /* The correspondance between the clumps and objects has been found. With
    this function, we want to correct the clump labels so the clump IDs in
    each object start from 1 and are contiguous. */
@@ -678,7 +938,10 @@ segment_on_threads(void *in_prm)
 
               /* Identify the objects in this detection using the grown
                  clumps and correct the grown clump labels into new object
-                 labels. */
+                 labels. When the number of clumps are large the
+                 array-based adjacency finding will consume too much
+                 memory. So we should switch to a list-based adjacency
+                 process instead. */
               segment_relab_to_objects(&cltprm);
               if(clprm->step==4)
                 {
diff --git a/doc/gnuastro.texi b/doc/gnuastro.texi
index 7428db7..b4964ea 100644
--- a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ -26415,22 +26415,16 @@ Note that the indexs will only be calculated on the 
pixels with a value of 1 and
 @deftypefun {gal_data_t *} gal_binary_connected_adjacency_matrix (gal_data_t 
@code{*adjacency}, size_t @code{*numconnected})
 @cindex Adjacency matrix
 @cindex Matrix, adjacency
-Find the number of connected labels and new labels based on an adjacency
-matrix, which must be a square binary array (type
-@code{GAL_TYPE_UINT8}). The returned dataset is a list of new labels for
-each old label. In other words, this function will find the objects that
-are connected (possibly through a third object) and in the output array,
-the respective elements for all input labels is going to have the same
-value. The total number of connected labels is put into the space that
-@code{numconnected} points to.
-
-An adjacency matrix defines connection between two labels. For example,
-let's assume we have 5 labels and we know that labels 1 and 5 are connected
-to label 3, but are not connected with each other. Also, labels 2 and 4 are
-not touching any other label. So in total we have 3 final labels: one
-combined object (merged from labels 1, 3, and 5) and the initial labels 2
-and 4. The input adjacency matrix would look like this (note the extra row
-and column for a label 0 which is ignored):
+Find the number of connected labels and new labels based on an adjacency 
matrix, which must be a square binary array (type @code{GAL_TYPE_UINT8}).
+The returned dataset is a list of new labels for each old label.
+In other words, this function will find the objects that are connected 
(possibly through a third object) and in the output array, the respective 
elements for all input labels is going to have the same value.
+The total number of connected labels is put into the space that 
@code{numconnected} points to.
+
+An adjacency matrix defines connection between two labels.
+For example, let's assume we have 5 labels and we know that labels 1 and 5 are 
connected to label 3, but are not connected with each other.
+Also, labels 2 and 4 are not touching any other label.
+So in total we have 3 final labels: one combined object (merged from labels 1, 
3, and 5) and the initial labels 2 and 4.
+The input adjacency matrix would look like this (note the extra row and column 
for a label 0 which is ignored):
 
 @example
             INPUT                             OUTPUT
@@ -26445,8 +26439,36 @@ in_lab 4 -->   0  0  0  0  0  0   |   | 0 | 1 | 2 | 1 
| 3 | 1 |
 in_lab 5 -->   0  0  0  1  0  0   |
 @end example
 
-Although the adjacency matrix as used here is symmetric, currently this
-function assumes that it is filled on both sides of the diagonal.
+Although the adjacency matrix as used here is symmetric, currently this 
function assumes that it is filled on both sides of the diagonal.
+@end deftypefun
+
+@deftypefun {gal_data_t *} gal_binary_connected_adjacency_list 
(gal_list_sizet_t @code{**listarr}, size_t @code{number}, size_t 
@code{minmapsize}, int @code{quietmmap}, size_t @code{*numconnected})
+@cindex RAM
+Find the number of connected labels and new labels based on an adjacency list.
+The output of this function is identical to that of 
@code{gal_binary_connected_adjacency_matrix}.
+But the major difference is that it uses a list of connected labels to each 
label instead of a square adjacency matrix.
+This is done because when the number of labels becomes very large (for example 
on the scale of 100,000), the adjacency matrix can consume more than 10GB of 
RAM!
+
+The input list has the following format: it is an array of pointers to 
@code{gal_list_sizet_t *} (or @code{gal_list_sizet_t **}).
+The array has @code{number} elements and each @code{listarr[i]} is a linked 
list of @code{gal_list_sizet_t *}.
+As a demonstration, the input of the same example in 
@code{gal_binary_connected_adjacency_matrix} would look like below and the 
output of this function will be identical to there.
+
+@example
+listarr[0] = NULL
+listarr[1] = 3
+listarr[2] = NULL
+listarr[3] = 1 -> 5
+listarr[4] = NULL
+listarr[5] = 3
+@end example
+
+From this example, it is already clear that this method will consume far less 
memory.
+But because it needs to parse lists (and not easily jump between array 
elements), it can be slower.
+But in scenarios where there are too many objects (that may exceed the whole 
system's RAM+SWAP), this option is a good alternative and the drop in 
processing speed is worth getting the job done.
+
+Similar to @code{gal_binary_connected_adjacency_matrix}, this function will 
write the final number of connected labels in @code{numconnected}.
+But since it takes no @code{gal_data_t *} argument (where it can inherit the 
@code{minmapsize} and @code{quietmmap} parameters), it also needs these as 
input.
+For more on @code{minmapsize} and @code{quietmmap}, see @ref{Memory 
management}.
 @end deftypefun
 
 @deftypefun {gal_data_t *} gal_binary_holes_label (gal_data_t @code{*input}, 
int @code{connectivity}, size_t @code{*numholes})
diff --git a/lib/binary.c b/lib/binary.c
index 7ebfdf9..2dd2dcf 100644
--- a/lib/binary.c
+++ b/lib/binary.c
@@ -653,7 +653,6 @@ gal_binary_connected_adjacency_matrix(gal_data_t *adjacency,
         ++curlab;
       }
 
-
   /* For a check.
   printf("=== Old labels --> new labels ===\n");
   for(i=1;i<num;++i) printf("%zu: %u\n", i, newlabs[i]);
@@ -668,6 +667,84 @@ gal_binary_connected_adjacency_matrix(gal_data_t 
*adjacency,
 
 
 
+/* List-based adjacency matrix: when the number of points to find adjacent
+   elements is large, the array-based adjacency solution of
+   'gal_binary_connected_adjacency_matrix' will consume far too much memory
+   (since it is a square). In those cases, we need to work based on
+   lists.
+
+   The input is an array of 'size_t' list pointers. Each one points to the
+   labels that are touching it.
+
+                     indexs        b       d
+                                   ^       ^
+                     indexs        a   b   c   a
+                                   ^   ^   ^   ^
+                     listarr:    | * | * | * | * |
+*/
+gal_data_t *
+gal_binary_connected_adjacency_list(gal_list_sizet_t **listarr,
+                                    size_t number, size_t minmapsize,
+                                    int quietmmap, size_t *numconnected)
+{
+  size_t i, p;
+  gal_list_sizet_t *tmp;
+  gal_data_t *newlabs_d;
+  gal_list_sizet_t *Q=NULL;
+  int32_t *newlabs, curlab=1;
+
+  /* Allocate (and clear) the output datastructure. */
+  newlabs_d=gal_data_alloc(NULL, GAL_TYPE_INT32, 1, &number, NULL, 1,
+                           minmapsize, quietmmap, NULL, NULL, NULL);
+  newlabs=newlabs_d->array;
+
+  /* Go over the input matrix and apply the same principle as we used to
+     identify connected components in an image: through a queue, find those
+     elements that are connected. */
+  for(i=1;i<number;++i)
+    if(newlabs[i]==0)
+      {
+        /* Add this old label to the list that must be corrected. */
+        gal_list_sizet_add(&Q, i);
+
+        /* Continue while the list has elements. */
+        while(Q!=NULL)
+          {
+            /* Pop the top old-label from the list. */
+            p=gal_list_sizet_pop(&Q);
+
+            /* If it has already been labeled then ignore it. */
+            if( newlabs[p]!=curlab )
+              {
+                /* Give it the new label. */
+                newlabs[p]=curlab;
+
+                /* Go over the adjacency list for this touching object and
+                   see if there are any not-yet-labeled objects that are
+                   touching it. */
+                for(tmp=listarr[p]; tmp!=NULL; tmp=tmp->next)
+                  if( newlabs[tmp->v]==0 )
+                    gal_list_sizet_add(&Q, tmp->v);
+              }
+          }
+
+        /* Increment the current label. */
+        ++curlab;
+      }
+
+  /* For a check.
+  printf("=== Old labels --> new labels ===\n");
+  for(i=1;i<number;++i) printf("%zu: %u\n", i, newlabs[i]);
+  */
+
+  /* Return the output. */
+  *numconnected = curlab-1;
+  return newlabs_d;
+}
+
+
+
+
 
 
 
diff --git a/lib/gnuastro/binary.h b/lib/gnuastro/binary.h
index 34fd46e..7811971 100644
--- a/lib/gnuastro/binary.h
+++ b/lib/gnuastro/binary.h
@@ -96,6 +96,10 @@ gal_data_t *
 gal_binary_connected_adjacency_matrix(gal_data_t *adjacency,
                                       size_t *numnewlabs);
 
+gal_data_t *
+gal_binary_connected_adjacency_list(gal_list_sizet_t **listarr,
+                                    size_t number, size_t minmapsize,
+                                    int quietmmap, size_t *numconnected);
 
 /*********************************************************************/
 /*****************            Fill holes          ********************/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]