[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Workflow management with GNU Guix

From: Roel Janssen
Subject: Re: Workflow management with GNU Guix
Date: Tue, 14 Jun 2016 11:16:31 +0200
User-agent: mu4e 0.9.17; emacs 24.5.1

Hello all,

Thank you for your replies.  I will use Ricardo's response to reply to.

Ricardo Wurmus writes:

> (Resending this as it could not be delivered.)
> Ricardo Wurmus <address@hidden> writes:
>> Hi Roel,
>>> With GNU Guix we are able to install programs to our machines with an 
>>> amazing
>>> level of control over the dependency graph of the programs.  We can now know
>>> what code will run when we invoke a program.  We can now know what the 
>>> impact
>>> of an upgrade will be.  And we can now safely roll-back to previous states.
>>> What seems to be a common practice in research involving data analysis, is
>>> running multiple programs in a chain to transform data from raw to 
>>> specific. 
>>> This is often referred to as a "pipeline" or a "workflow".  Because data 
>>> sets
>>> can be quite large in comparison to the computing power of our laptops, the
>>> data analysis is performed on computing clusters instead of single machines.
>>> The usage of a pipeline/workflow is somewhat different from the package
>>> construction, because we want to run the sequence of commands on different 
>>> data
>>> sets (as opposed to running it on the same source code).  Plus, I would 
>>> like to
>>> integrate it with existing computing clusters that have a job scheduling 
>>> system
>>> in place.  
>>> The reason I think this should be possible with Guix is that it has
>>> everything in place to do software deployment and run-time isolation
>>> (containers).  From there it is a small step to executing programs in an
>>> automated way.
>>> So, I would like to propose a new Guix subcommand and an extension to
>>> the package management language to add workflow management features.
>> I probably don’t understand your idea well enough, but from what I
>> understand it doesn’t really have much to do with packages (other than
>> using them) and store manipulation per se (produced artifacts are not
>> added to the store).  Exactly what features of Guix do you want to build
>> on?

I would like to build on the language to express packages.  What's nice
about the package recipes is that they are understandable, they are
shareable (just copy and paste the recipe) and from them a reproducible
output can be produced.

A package recipe describes its entire dependency graph because the
symbols in the inputs are turned into specific versions of the external
packages.  This is a very powerful feat for specifying how to run

>> My perspective on pipelines is that they should be developed like any
>> other software package, treating individual tools as you would treat
>> libraries.  This means that a pipeline would have a configuration step
>> in which it checks for the paths of all tools it needs internally, and
>> then use the full paths rather than assume all tools to be in a
>> directory listed in the PATH variable.

When we would use Guix package recipes to describe tools, we wouldn't
need to search for them. We could just set up a profile with these tools
and set the environment variables suggested by Guix accordingly.

This way we can generate the exact dependency graph of a pipeline,
leaving no ambiguity to the run-time environment.

>> Distributing jobs to clusters would be the responsibility of the
>> pipeline, e.g. by using DRMMA, which supports several resource
>> management backends and has bindings for a wide range of programming
>> languages.

Wouldn't it be easier to write a pipeline in a language that has the
infrastructure to uniquely describe and deploy a program and its
dependencies?  You don't need to search for available tools, you can
just install them.  If they were available already, installing will be a
matter of creating a couple of symbolic links.

Here is a translation of a "real-world" process definition to my
<process> record type from one of the pipelines I studied.  It isn't a
perfect example because it uses a package that isn't in Guix..  Anyway:

(define (rnaseq-fastq-quality-control in out)
    (name "rnaseq-fastq-quality-control")
    (version "1.0")
     `(("fastqc" ,fastqc-bin-0.11.4)))
    (input in)
    (output (string-append out "/" name))
      (interpreter 'guile)
        (let ((sample-files (find-files in #:directories? #f)))
            ;; Create output directories.
            (unless (access? ,out F_OK) (mkdir ,out))
            (unless (access? ,output F_OK) (mkdir ,output))
            ;; Perform the analysis step.
            (map (lambda (file)
                   (when (string-suffix? ".fastq.gz" file)
                     (system* "fastqc" "-q" file "-o" ,output)))
    (synopsis "Generate quality control reports for FastQ files")
    (description "This process generates a quality control report
for a single FastQ file.")))

The resulting expression in `source' can be executed with Guile in any
place on a computing cluster (as long as the files are accessible at the
same location on other machines).

This snippet can be copy-pasted elsewhere and be included in another
pipeline without adjusting what job distribution system should be used.
We can deal with that on the "workflow" level instead of the "process"

I left the option open to use other scripting languages, but we could
compact it a bit more when only using Guile.

>>> Would this be a feature you are interested in adding to GNU Guix?
>> Even if it wasn’t part of Guix itself, you could develop it separately
>> and still add it as a Guix command, much like it is currently done for
>> “guix web” (which I think should eventually be part of Guix).

That may be a good idea.

>>> I'm currently working on a proof-of-concept implementation that has three
>>> record types/levels of abstraction:
>>> <workflow>:  Describes which <process>es should be run, and concerns itself 
>>> with
>>>              the order of execution.
>>> <process>:   Describes what packages are needed to run the programs 
>>> involved,
>>>              and its relationship to other processes.  Processes take input 
>>> and
>>>              generate output much like the package construction process.
>>> <script>:    Short and simple imperative instructions to perform a task. 
>>> They are
>>>              part of a <process>.  Currently, my implementation generates a 
>>> shell
>>>              script that can be either Guile, Sh, Perl or Python.
>> From that list it seems as if the only link to Guix is ensuring the
>> environment contains required programs.  This can be done right now with
>> the help of manifests and profiles.
>> I wonder if maybe we could add Guix as a package management backend to
>> existing workflow specification systems (instead of the curiously
>> popular and IMO barely adequate Conda, for example).

That is an option too.  The workflow specification systems overlap in
describing tools though.  For example, the Common Workflow Language
(CWL).  If we then look at:

The `requirements' field is the equivalent of `inputs' and
`propagated-inputs' in Guix.

With Guix, we could describe a command-line tool by refering to the
package recipe, and then write the command to run.

>>> The subcommand I envision is:
>>>   guix workflow
>>> With primarily:
>>>   guix workflow --run=<name-of-workflow-definition>
>>> If you are interested in adding any form of workflow management to GNU 
>>> Guix, I
>>> can elaborate on my proof-of-concept implementation, so we can work from 
>>> there.
>>> (or throw everything out of the window and start from scratch ;-))
>> Could you show us an example workflow?

So, the <process>es look like the snippet provided above.  Then the
workflow itself looks like:

(define (rnaseq-pipeline in out)
   (name "rnaseq-pipeline")
   (version "1.0")
   (input in)
   (output (string-append
            out "/" name "-" (date->string (current-date) "~Y-~m-~d")))
    `((,rnaseq-fastq-quality-control ,rnaseq-initialize)
      (,rnaseq-align ,rnaseq-initialize)
      (,rnaseq-add-read-groups ,rnaseq-align)
      (,rnaseq-index ,rnaseq-add-read-groups)
      (,rnaseq-collect-alignment-metrics ,rnaseq-index)
      (,rnaseq-feature-readcount ,rnaseq-index)
      (,rnaseq-merge-read-features ,rnaseq-feature-readcount)
      (,rnaseq-compute-rpkm-values ,rnaseq-merge-read-features)
      (,rnaseq-normalize-read-counts ,rnaseq-merge-read-features)
      (,rnaseq-differential-expression ,rnaseq-merge-read-features)))
   (synopsis "RNA sequencing pipeline used at the UMCU")
   (description "The RNAseq pipeline can do quality control on FastQ and BAM
files; align reads against a reference genome; count reads in features;
normalize read counts; calculate RPKMs and perform DE analysis of standard

The `restrictions' are dependency pairs (A B) where A depends on the
successful completion of B.  From this, the execution order can be

Thank you all for your time.

Kind regards,
Roel Janssen

reply via email to

[Prev in Thread] Current Thread [Next in Thread]