[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: leaky pipelines and Guix
From: |
myglc2 |
Subject: |
Re: leaky pipelines and Guix |
Date: |
Fri, 04 Mar 2016 18:29:20 -0500 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) |
address@hidden (Ludovic Courtès) writes:
> Ricardo Wurmus <address@hidden> skribis:
>> [...]
>> So, how could I package something like that? Is packaging the wrong
>> approach here and should I really just be using “guix environment” to
>> prepare a suitable environment, run the pipeline, and then exit?
> Maybe packages are the wrong abstraction here?
>
> IIUC, a pipeline is really a function that takes inputs and produces
> output(s). So it can definitely be modeled as a derivation.
I built and ran reproducible pipelines on HPC clusters for the last 5
years. IMO the derivation model fits (disclaimer, I am still trying to
figure Guix out ;)
I think of a generic pipeline as a series of job steps (series of
derivations). Job steps must be configured at a meta level in terms of
parameters, dependicies, inputs and outputs. I found Grid Engine qmake
(which is GNU Make integrated with the Grid Engine scheduler) extremely
useful for this. I used it to configure & manage the pipeline, express
dependencies, partition & manage parallel tasks, deal with error
conditions, and manage starts and re-starts. Such pipeline jobs ran for
weeks without incident.
I dealt with the input/output problem using a recursive sub-make
architecture in which data flowed up the analysis (make) directory
tree. I dealt with modularity by using git submodules. I checked results
into git for provenance. The only real fly in the ointment was that make
uses time stamps. What you really want is a hash or a git SHA. Of course
there were also hideous problems w/software configuration, but I expect
guix will solve those :=)
> Perhaps as a first step you could try and write a procedure and a CLI
> around it that simply runs a given pipeline:
>
> $ guix biopipeline foo frog.dna human.dna
> …
> /gnu/store/…-freak.dna
>
> The procedure itself would be along the lines of:
>
> (define (foo-pipeline input1 input2)
> (gexp->derivation "result"
> #~(begin
> (setenv "PATH" "/foo/bar")
> (invoke-make-and-co #$input1 #$input2
> #$output))))
Sidebar:
- What is "biopipeline" above? A new guix command?
- Should "foo-pipeline" read "foo", or visa versa?
>> [...]
>> However, most pipelines do not take this approach. Pipelines are often
>> designed as glue (written in Perl, or as Makefiles) that ties together
>> other tools in some particular order. These tools are usually assumed
>> to be available on the PATH.
Yes, these pipelines are generally badly designed.
>> [...]
>> I can easily create a shared profile containing the tools that are
>> needed by a particular pipeline and provide a wrapper script that
>> does something like this (pseudo-code):
>>
>> bash
>> eval $(guix package --search-paths=prefix)
>> do things
>> exit
>>
>> But I wouldn’t want to do this for individual users, letting them
>> install all tools in a separate profile to run that pipeline, run
>> something like the above to set up the environment, then fetch the
>> tarball containing the glue code that constitutes the pipeline
>> (because we wouldn’t offer a Guix package for something that’s not
>> usable without so much effort to prepare an environment first),
>> unpack it and then run it inside that environment.
>>
>> To me this seems to be in the twilight zone between proper packaging and
>> a use-case for “guix environment”. I welcome any comments about how to
>> approach this and I’m looking forward to the many practical tricks that
>> I must have overlooked.
An attraction of Guix is the possibility of placing job step inputs and
outputs in the store, or something like the store. So how about
integrating GNU Make with Guix to enable job steps that are equivalent
to ...
step1:
guix environment foo bar && read the store, do things, save in store
... Or maybe something like ...
step2:
send to guix-daemon:
> (define (foo-pipeline input1 input2)
> (gexp->derivation "result"
> #~(begin
> (setenv "PATH" "/foo/bar")
> (invoke-make-and-co #$input1 #$input2
> #$output))))
Then you can provide a pipeline by providing a makefile.
Things needed to make this work:
- make integration with guix store
- make integration with guix-daemon
- HPC scheduler: qmake allows specific HPC resources to be requested for
each job step (e.g. memory, slots(cpus)). Grid engine uses these to
determine where the steps run. Maybe these features could be achieved
by running the guix daemon over a scheduler, like slurm. Or maybe by
submitting job steps to slurm wich are in turn run by the daemon?
(disclaimer, I am still trying to figure Guix out ;) - George
- Re: leaky pipelines and Guix,
myglc2 <=