[gnuastro-devel] [task #14319] Full reproduction pipeline in FITS extens

gnuastro-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuastro-devel] [task #14319] Full reproduction pipeline in FITS extens

From:	Mohammad Akhlaghi
Subject:	[gnuastro-devel] [task #14319] Full reproduction pipeline in FITS extension
Date:	Mon, 23 Jan 2017 15:20:54 +0000 (UTC)
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0

URL:
  <http://savannah.gnu.org/task/?14319>

                 Summary: Full reproduction pipeline in FITS extension
                 Project: GNU Astronomy Utilities
            Submitted by: makhlaghi
            Submitted on: Tue 24 Jan 2017 12:20:52 AM JST
         Should Start On: Mon 23 Jan 2017 12:00:00 AM JST
   Should be Finished on: Mon 23 Jan 2017 12:00:00 AM JST
                Category: Table
                Priority: 5 - Normal
              Item Group: Enhancement
                  Status: Postponed
                 Privacy: Public
        Percent Complete: 0%
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
                  Effort: 0.00

    _______________________________________________________

Details:

The reproduction pipeline (for example see the reproduction pipeline for the
paper introducing NoiseChisel
<https://gitlab.com/makhlaghi/NoiseChisel-paper>) that produced a specific
research result or data-set (image/table) is vitally important for its
scientific integrity. 

Ultimately, as discussed in that reproduction example, if a result/dataset is
not reproducible (by an anonymous peer, with no personal communication with
the authors), the result/data cannot be given a "scientific" label.

Currently there is no special way to keep all this very valuable information
along with the data in a FITS file. As a consequence, what most surveys do is
to include a very large collection of header keywords from all the files that
were used to produce the result along with the configuration values they used
in their software (for example see any of the FITS files in the Hubble Space
Telescope GOODS-North processed data
<https://archive.stsci.edu/pub/hlsp/goods/v2/>). 

But merely keeping this information is not enough, the order of operations is
also vitally important and it is very hard to keep/transmit this ordering
information along with the full set of configuration parameters through header
keywords. Another problem with this large collection of header keywords is
navigation by the user: it is really hard to find important information in all
these (mostly repetative) keywords.

So here, I am suggesting to use one of the great features of the FITS standard
to address this problem. We can add a new program into Gnuastro to write/read
the full reproduction pipeline (like the one above) into a FITS binary table,
variable-length array (see section 7.3.5 of the FITS 3.0 definition paper
<http://www.aanda.org/articles/aa/abs/2010/16/aa15362-10/aa15362-10.html>). By
nature, such pipelines won't take more than a few hundred kilobytes (at most),
so keeping them along with ten-or-hundred-megabyte datasets is no significant
burden on the servers or upload/download, but allows the full procedure to
generate the data to be kept in the same file as the dataset. Also, when the
pipeline is heavily commented (like the example above), anyone can benefit
from it and understand it.

We can take the two following procedures: 

* The full reproduction pipeline (all directories/subdirectories long with
files) can be put into one `tar.lz' compressed file, and put that into the
variable length array. Note that Lzip <http://www.nongnu.org/lzip/lzip.html>
offers much better archivability features and also compression ratios for
source code compared to other existing compressors like Gzip, Bzip2, or
`.xz'.

* Each reproduction pipeline filename (along with its directory information)
can be kept separately in the first column of the FITS binary table and its
contents can be compressed (with Lzip) and kept in the next column. 

In both cases, the high-level command-line Gnuastro program will allow easy
manipulation of this reproducibility information, for example bringing out the
full/partial reproduction pipeline from the FITS binary table. If we define
special conventions for variable names in the pipeline it can also be possible
to pull out only the desired variable value without actually saving the full
file for example.

When the datasets are particularly large (for example >100Mb), we can also add
the Lzip compressed Gnuastro (and its dependencies) tarballs (all together
less than 10 megabytes) as other rows of this binary table. This way, the data
will be truely exactly reproducible, only needing very low-level things like a
C library and a compiler.




    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/task/?14319>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[gnuastro-devel] [task #14319] Full reproduction pipeline in FITS extension, Mohammad Akhlaghi <=

Prev by Date: [gnuastro-devel] [task #14317] New name for the Arithmetic program and library
Next by Date: [gnuastro-devel] [task #14315] Propagate all FITS header keywords into output FITS files
Previous by thread: [gnuastro-devel] [task #13658] Work on concave polygons too
Next by thread: [gnuastro-devel] [task #14320] Alternatives for program name suffix
Index(es):
- Date
- Thread