guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [GWL] (random) next steps?


From: zimoun
Subject: Re: [GWL] (random) next steps?
Date: Mon, 17 Dec 2018 18:33:49 +0100

Dear,

> I’m working on updating the GWL manual to show a simple wispy example.

Nice!
I am also giving a deeper look at the manual. :-)


> > Some GWL scripts are already there.
> I only know of Roel’s ATACseq workflow[1], but we could add a few more
> independent process definitions for simple tasks such as sequence
> alignment, trimming, etc.  This could be a git subtree that includes an
> independent repository.

Yes it should be a git subtree.
An idea should be to collect examples and in the same time to improve
kind of test suite.
I mean I have in mind to collect simple and minimal examples to also
populate the tests/.

At starting point (and with minimal effort), I would like to rewrite
the minimal snakemake examples, e.g.,
https://snakemake.readthedocs.io/en/stable/getting_started/examples.html

Once a wispy "syntax" fixed, it will be a good exercise. ;-)


> Scheme itself has (format #f "…" foo bar) for string interpolation.
> With a little macro we could generate the right “format” invocation, so
> that the user could do something similar to what you suggested:
>
>     (shell "gzip ${data-inputs} -c > ${outputs}")
>
>     –> (system (format #f "gzip ~a -c > ~a" data-inputs outputs))
>
> String concatenation is one possibility, but I hope we can do better
> than that.  scsh offers special process forms that would allow us to do
> things like this:
>
>     (shell (gzip ,data-inputs -c > ,outputs))
>
> or
>
>     (run (gzip ,data-inputs -c)
>          (> 1 ,outputs))
>
> Maybe we can take some inspiration from scsh.

I did not know about scsh. I am giving a look...

What I have in mind is to reduce the "gap" between the Lisp syntax and
more mainstream-ish syntax as Snakemake or CWL.
The comma s.t. (shell (gzip ,data-inputs -c > ,outputs)) are nice!
But it is less "natural" than the simple string interpolation, at
least to people in my environment. ;-)

What do you think ?


> > 6.
> > The graph of dependencies between the processes/units/rules is written
> > by hand. What should be the best strategy to capture it ? By files "à
> > la" Snakemake ? Other ?
>
> The GWL currently does not use the input information provided by the
> user in the data-inputs field.  For the content addressible store we
> will need to change this.  The GWL will then be able of determining that
> data-inputs are in fact the outputs of other processes.

Hum? nice but how?
I mean, the graph cannot be deduced and it needs to be written by
hand, somehow. Isn't it?



Last, just to fix the ideas about what we are talking about in terms
of input/output sizes.
An aligner as Bowtie2/BWA uses as inputs:
 - a fixed dataset (reference): it is approx. 25GB for human species.
 - experimental data (specific genome): it is approx 10GB for some
kind of sequencing and say that the series are approx. 50 experiments
or more (one cohort); so you have to deal with 500GB for one analysis.
The output for each data is around 20GB. Then this output is used by
another tools to trim, filter out, compare, etc.
I mean, part of the time is spent in moving data (read/write),
contrary to HPC-simulations---other story, other issues (MPI, etc.).

Strategies à la git-annex (Haskell, again! ;-) should be nice. But is
the history useful ?


Thank you for any comments or ideas.

All the best,
simon



reply via email to

[Prev in Thread] Current Thread [Next in Thread]