[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gnuastro-devel] [task #14319] Full reproduction pipeline in FITS extens
From: |
Mohammad Akhlaghi |
Subject: |
[gnuastro-devel] [task #14319] Full reproduction pipeline in FITS extension |
Date: |
Mon, 23 Jan 2017 15:20:54 +0000 (UTC) |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0 |
URL:
<http://savannah.gnu.org/task/?14319>
Summary: Full reproduction pipeline in FITS extension
Project: GNU Astronomy Utilities
Submitted by: makhlaghi
Submitted on: Tue 24 Jan 2017 12:20:52 AM JST
Should Start On: Mon 23 Jan 2017 12:00:00 AM JST
Should be Finished on: Mon 23 Jan 2017 12:00:00 AM JST
Category: Table
Priority: 5 - Normal
Item Group: Enhancement
Status: Postponed
Privacy: Public
Percent Complete: 0%
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
Details:
The reproduction pipeline (for example see the reproduction pipeline for the
paper introducing NoiseChisel
<https://gitlab.com/makhlaghi/NoiseChisel-paper>) that produced a specific
research result or data-set (image/table) is vitally important for its
scientific integrity.
Ultimately, as discussed in that reproduction example, if a result/dataset is
not reproducible (by an anonymous peer, with no personal communication with
the authors), the result/data cannot be given a "scientific" label.
Currently there is no special way to keep all this very valuable information
along with the data in a FITS file. As a consequence, what most surveys do is
to include a very large collection of header keywords from all the files that
were used to produce the result along with the configuration values they used
in their software (for example see any of the FITS files in the Hubble Space
Telescope GOODS-North processed data
<https://archive.stsci.edu/pub/hlsp/goods/v2/>).
But merely keeping this information is not enough, the order of operations is
also vitally important and it is very hard to keep/transmit this ordering
information along with the full set of configuration parameters through header
keywords. Another problem with this large collection of header keywords is
navigation by the user: it is really hard to find important information in all
these (mostly repetative) keywords.
So here, I am suggesting to use one of the great features of the FITS standard
to address this problem. We can add a new program into Gnuastro to write/read
the full reproduction pipeline (like the one above) into a FITS binary table,
variable-length array (see section 7.3.5 of the FITS 3.0 definition paper
<http://www.aanda.org/articles/aa/abs/2010/16/aa15362-10/aa15362-10.html>). By
nature, such pipelines won't take more than a few hundred kilobytes (at most),
so keeping them along with ten-or-hundred-megabyte datasets is no significant
burden on the servers or upload/download, but allows the full procedure to
generate the data to be kept in the same file as the dataset. Also, when the
pipeline is heavily commented (like the example above), anyone can benefit
from it and understand it.
We can take the two following procedures:
* The full reproduction pipeline (all directories/subdirectories long with
files) can be put into one `tar.lz' compressed file, and put that into the
variable length array. Note that Lzip <http://www.nongnu.org/lzip/lzip.html>
offers much better archivability features and also compression ratios for
source code compared to other existing compressors like Gzip, Bzip2, or
`.xz'.
* Each reproduction pipeline filename (along with its directory information)
can be kept separately in the first column of the FITS binary table and its
contents can be compressed (with Lzip) and kept in the next column.
In both cases, the high-level command-line Gnuastro program will allow easy
manipulation of this reproducibility information, for example bringing out the
full/partial reproduction pipeline from the FITS binary table. If we define
special conventions for variable names in the pipeline it can also be possible
to pull out only the desired variable value without actually saving the full
file for example.
When the datasets are particularly large (for example >100Mb), we can also add
the Lzip compressed Gnuastro (and its dependencies) tarballs (all together
less than 10 megabytes) as other rows of this binary table. This way, the data
will be truely exactly reproducible, only needing very low-level things like a
C library and a compiler.
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/task/?14319>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [gnuastro-devel] [task #14319] Full reproduction pipeline in FITS extension,
Mohammad Akhlaghi <=