[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gwl-devel] [GWL] (random) next steps?

From: zimoun
Subject: Re: [gwl-devel] [GWL] (random) next steps?
Date: Fri, 4 Jan 2019 18:48:34 +0100

Hi Ricardo,

Happy New Year !!

> We can connect a graph by joining the inputs of one process with the
> outputs of another.
> With a content addressed store we would run processes in isolation and
> map the declared data inputs into the environment.  Instead of working
> on the global namespace of the shared file system we can learn from Guix
> and strictly control the execution environment.  After a process has run
> to completion, only files that were declared as outputs end up in the
> content addressed store.
> A process could declare outputs like this:
>     (define the-process
>       (process
>         (name 'foo)
>         (outputs
>          '((result "path/to/result.bam")
>            (meta   "path/to/meta.xml")))))
> Other processes can then access these files with:
>     (output the-process 'result)
> i.e. the file corresponding to the declared output “result” of the
> process named by the variable “the-process”.

Ok, in this spirit?

>From my point of view, there is 2 different paths:
 1- the inputs-outputs are attached to the process/rule/unit
 2- the processes/rules/units are a pure function and then the
`workflow' describes how to glue them together.

If I understand well, Snakemake is about the path 1-. From the
inputs-outputs chain, the graph is deduced.
Attached, a dummy example with snakemake where I re-use one `shell'
between 2 different rules. It is ugly because it works with strings.
And the rule `filter' cannot be used without the rule `move_1' since
the two rules are explicitly connected by their input-output.

The other approach is to define a function that returns a process.
Then one needs to specify the graph with the `restrictions', other
said which function composes with which one. However, because we also
want to track the intermediate outputs, the inputs-outputs is
specified for each process; should be optional, isn't it? If I
understand well, it is one possible approach of Dynamic Workflows by

On one hand, from the path 1-, it is hard to reuse the process/rule
because the composition is hard-coded in the inputs-outputs
(duplication of the same process/rule with different inputs-outputs).
The graph is written by the user when it writes the inputs-outputs
On the other hand, from the path 2-, it is difficult to provide both
the inputs-outputs to the function and also the graph without
duplicate some code.

I do not have the mind really clear and I have no idea how to achieve
the idea below of the functional paradigm.
The process/rule/unit is function with free inputs-outputs (argument
or variable) and it returns a process.
The workflow is a scope where these functions are combined through
some inputs-outputs.

For example, let define 2 processes: move and filter.

(define* (move in out #:optional (opt ""))
    `(("mv" ,mv)))
   (input in)
   (output out)
    `(system ,(string-append " mv " opt " " input output)))))

(define (filter in out)
    `(("sed" ,sed)))
   (input in)
   (output out)
    `(system ,(string-append "sed  '1d' " input " > " output)))))

Then let create the workflow that encodes the graph:

(define wkflow:move->filter->move
   (let ((tmpA (temp-file))
         (tmpB (temp-file)))
      `((,move "my-input" ,tmpA)
        (,filter ,tmpA ,tmpB)
        (,move ,tmpB "my-output" " -v "))))))

>From the `processes', it should be nice to deduce the graph.
I am not sure it is possible... Even if it lacks which one is the
entry point. But it should be fixed by the `input' and `output' field
of `workflow'.

Since the move and filter are just pure function, one can easily reuse
them and e.g. apply in a different order:

(define wkflow:filter->move
   (let ((tmp (temp-file)))
      `((,move ,tmp "one-output")
        (,filter "one-input" ,tmp))))))

As you said, one thing should also be:

      `((,move ,(output filter) "one-output")
        (,filter "one-input" ,(temp-file #:hold #t)))

Do you think it is doable? How hard should be?

> The question here is just how far we want to take the idea of “content
> addressed” – is it enough to take the hash of all inputs or do we need
> to compute the output hash, which could be much more expensive?

Yes, I agree.
Moreover, if the output is hash, then the hash should depend on the
hash of the inputs and of the hash of the tools, isn't it?

To me, once the workflow is computed, one is happy with their results.
Then after a couple of months or years, one still has a copy of the
working folder but they are not able to find how they have been
computed: which version of the tools, the binaries is not working
anymore, etc.
Therefore, it should be easy to extract from the results how they have
been computed: version, etc.

Last, is it useful to write on disk the intermediate files if they are
not stored?
In the tread [0], we discussed the possibility to stream the pipes.
Let say, the simple case:
   filter input > filtered
   quality filtered > output
and the piped version is better is you do not mind about the filtered file:
   filter input | quality > ouput

However, the classic pipe does not fit for this case:
   filter input_R1 > R1_filtered
   filter input_R2 > R2_filtered
   align R1_filtered R2_filtered > output_aligned
In general, one is not interested to conserve the files
R{1,2}_filtered. So why spend time to write them on disk and to hash

In other words, is it doable to stream the `processes' at the process level?

It is different point of view, but it reaches the same aim, I guess

Last, could we add a GWL session to the before-FOSDEM days?

What do you think?

Thank you.

All the best,


Attachment: func.smk
Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]