RE: guix and mirroring dataset

guix-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: guix and mirroring dataset

From:	Cook, Malcolm
Subject:	RE: guix and mirroring dataset
Date:	Thu, 27 May 2021 04:37:48 +0000

>> Does the guix project and members suggest best guix-ish practices for
>> managing on premise mirrors of large file-based data-sets such as
>> appear in genomics HPC evironments? 
>
>From my understanding, it is still “unsolved“ and there is no clear
>answer.
>
>Basically, the /gnu/store is not designed for managing large dataset and
>something is somehow missing. On the mailing list mailto:gwl-devel@gnu.org, we
>have already discussed that point although nothing came up, AFAIU.
>Recently, we discussed again, see the thread:
>
><https://yhetil.org/gwl/87r1k2ti7k.fsf@elephly.net/T/>

Nice - I missed that thread. It brings up good considerations:
 - "immutable” v “mutable" resources
 - IPFS as possible means of distribution

>
>Your input is welcome. :-)

I was expecting to find workflows that have been developed for mirroring 
(downloading) genomic resources from sites such as Ensembl/NCBI/UCSC, etc, and 
then creating on-prem derived resources (e.g. blast indexes).  

I currently tend to do this with Gnu Make and shell scripting.

I was not expecting to find guix efforts toward maintaining such pre-computed 
derived datasets in upstream repository of any sort, though that would be 
valuable to some.  Illumina for instance (used to?) keep selected genome 
indices for use with their software.  But that is not what I seek....   and I 
think much of your remaining reply assumes it is.

>> Perhaps a guix-ish response to [Go Get Data \(GGD\) is a framework
>> that facilitates reproducible access to genomic
>> data](https://www.nature.com/articles/s41467-021-22381-z) 
>
>AFAIR, Ricardo pointed this GoGetData. Personally, I have not yet look
>at the details.

GoGetData does not seek to make upstream derived datasets available.  Rather 
their aim is to provide "as a fast, reproducible approach to installing 
standardized data recipes".  I assume GWL would be a good language to write 
such recipes, and that someone may already be doing so....

GoGetData recipes are just bash scripts organized in a particular folder 
structure in a github repo that are expected to comport to a few conventions 
(e.g. variable names for genomes, species, etc) with a required yaml schema for 
their metadata.  The do not have any advanced workflow capabilities such as GWL 
might provide.

>> That would build on GWL?
>
>From my understanding, something is missing between ’packages’,
>’process’ and ’workflow’, for instance ’data’. And speaking about
>genomics, there is 2 kinds of large data:
>
>- fixed output (immutable?): think FASTA and FASTQ
>- computed output (mutable?): think BAM and indexes
>
>and it is not clear how to deal with them. And once that answered, how
>to share them (substitutes)? HTTP as all are doing, but we could also
>want IPFS or any other things which would avoid the mirroring/sync
>issues. 
>
>> Use cases would be, e.g. download/sync selected (versions of) genomes
>> from Ensembl/NCBI etc and index them for Blast, blat, bowtie{2}, bwa,
>> STAR, GMAP, HiSAT, IGV, BioConductor, etc... 
>>
>> I see much that addresses analysis workflows, such as
>> - [Reproducible genomics analysis pipelines with GNU 
>> Guix](https://www.biorxiv.org/content/10.1101/298653v2.full)
>> - [Scalable Workflows and Reproducible Data Analysis for 
>> Genomics](https://pubmed.ncbi.nlm.nih.gov/31278683/)
>> - [PiGx: reproducible genomics analysis pipelines with GNU 
>> Guix](https://academic.oup.com/gigascience/article/7/12/giy123/5114263)
>>
>> Am I missing similar efforts toward maintaining an up-to-date catalog
>> of the genomic resources that such workflows require? 
>
>For now, some are maintained as packages, for instance:
>
>$ guix search "^r-" hg19 | recsel -C -P name
>r-phastcons100way-ucsc-hg19
>r-bsgenome-hsapiens-ucsc-hg19-masked
>r-txdb-hsapiens-ucsc-hg19-knowngene
>r-bsgenome-hsapiens-ucsc-hg19
>r-snplocs-hsapiens-dbsnp144-grch37
>r-illuminahumanmethylation450kanno-ilmn12-hg19
>r-fdb-infiniummethylation-hg19
>r-copyhelper

Yes, thanks, I see that guix has versions of BioConductor data packages.  These 
are interesting use case.

>
>which are relative small, for another instance:
>
>--8<---------------cut here---------------start------------->8---
>r-txdb-hsapiens-ucsc-hg38-knowngene total: 91.8 MiB
>r-bsgenome-hsapiens-ucsc-hg38 total: 765.2 MiB
>r-copyhelper total: 42.9 MiB
>--8<---------------cut here---------------end--------------->8---
>
>
>Hope that helps,
>simon

Thanks Simon, I'm pleased to have your thoughts and pointers on this topic...

~Malcolm

[Prev in Thread]

Current Thread

[Next in Thread]

guix and mirroring dataset, Cook, Malcolm, 2021/05/17
- Re: guix and mirroring dataset, zimoun, 2021/05/26
  - RE: guix and mirroring dataset, Cook, Malcolm <=
    - RE: guix and mirroring dataset, zimoun, 2021/05/27

Prev by Date: Re: guix and mirroring dataset
Next by Date: Re: website: A little help running the website locally
Previous by thread: Re: guix and mirroring dataset
Next by thread: RE: guix and mirroring dataset
Index(es):
- Date
- Thread