parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parallel + blast + LSF


From: George Marselis
Subject: Re: parallel + blast + LSF
Date: Wed, 15 Apr 2015 20:44:26 +0300

By the way, LSF and GNU parallel do almost the same thing. So using one of the two, defeats the purpose of using the other. 

In the same way, you could have used LSF to submit your jobs to LSF:

bsub < script.sh

where script.sh was 

bsub -J amoeba -q smalljobs  qfasta file1
bsub -J amoeba -q smalljobs  qfasta file2
...
bsub -J amoeba -q smalljobs  qfasta file2000

On Wed, Apr 15, 2015 at 8:39 PM, George Marselis <address@hidden> wrote:
Hi. LSF/Openlava sysadmin in bioinformatics and parallel user here.

I have seen this a couple more times: You are trying to use GNU parallel to submit the jobs to all nodes.

THat's now the way to do things: You should not submit jobs on *all* your nodes. Please don't do that, as bsub was not designed to read large chunks of jobs. bsub writes the jobs to your home directory, so if your storage is not designed for a lot of writes, you are going to blow the cluster out of the water. 

What you want to do is look up either: 

1. bsub scripts https://rc.fas.harvard.edu/resources/documentation/legacy-lsf/lsf-submit-an-lsf-job/

or 

2. job arrays https://rc.fas.harvard.edu/resources/documentation/legacy-lsf/lsf-submitting-lots-of-short-jobs-job-arrays/

Both bsub scripts and job arrays are useful to you: bsub scripts can be submitted as part of a pipeline: you can program the output of the bsub script from your pipeline and then submit it to bsub. So, instead of submitting your job 2000 times as in

bsub job0
bsub job1

....

bsub job1999

you just submit "bsub < scriptname" which contains 2000 lines which describe your jobs and you are done. The rest is done by bsub/LSF


Now, if your jobs are similar in a way that you just increment counter (as in most bioinformatics jobs), use arrays. 

bsub -J JOBNAME[0-1999], where JOBNAME is a string you would like to name your job as, eg "fasta files alignment"


These techniques are useful because you can submit all 2000 jobs in less than a second, you can do it from a single node and you will not have to deal with a grumpy sysadmin or grumpy colleagues who cannot use the cluster. Just make sure you use the appropriate queue.

Let me know if you have any questions.

Best Regards,

George Marselis

On Wed, Apr 15, 2015 at 6:48 PM, Martin d'Anjou <address@hidden> wrote:
Hi,

Thanks for clarifying. I want to use GNU Parallel to bsub jobs. This way I can use GNU Parallel to throttle the number of jobs that are submitted to LSF, and it is easier than writing a loop.

parallel -j 100 my_script [bsub options] ::: {1..2000}

my_script (pseudo-code):
#!/bin/bash
...
bsub [bsub options] command ...
post-process data

This way I can submit jobs, say 100 at a time. When I submit all 2000 jobs, it gets problematic and I start hitting limits with file descriptors, etc.

Thanks for sharing,
Martin


On 15-04-15 11:35 AM, Giuseppe Aprea wrote:
Hi Martin,

I am not sure I understand. As far as I can see, things work exactly the opposite way: you have an LSF script which launches GNU Parallel on some hosts provided by LSF. Something like:

-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
#!/bin/bash

#BSUB -J gnuParallel_blast_test      # Name of the job.
#BSUB -o %J.out                              # Appends std output to file %J.out. (%J is the Job ID)
#BSUB -e %J.err                               # Appends std error to file %J.err.
#BSUB -q large                                 # Queue name.
#BSUB -n 30                                      # Number of CPUs.

module load 4.8.3/ncbi/12.0.0
module load 4.8.3/parallel/20150122

SLOTS=`cat ${LSB_DJOB_HOSTFILE} |wc -l`

SERVER=""

for i in `cat ${LSB_DJOB_HOSTFILE}| sort`
do
echo "/afs/enea.it/software/bin/blaunch.sh ${i}" >> servers
done 

cat absolute_path_to_sequences.fasta | parallel --no-notice -vv -j ${SLOTS} --slf servers --plain --recstart '>' -N 1 --pipe blastp -evalue 1e-05 -outfmt 6 -db absolute_path_to_db_file -query - -out absolute_path_to_result_file_{%}
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------

LSF is the one which gives you the execution hosts so if you are launching bsub from GNU parallel how do you know how to set the --slf option?


g



On Wed, Apr 15, 2015 at 4:24 PM, Martin d'Anjou <address@hidden> wrote:
On 15-04-15 09:34 AM, Giuseppe Aprea wrote:
Hi all,

I would like to ask you, please, some help in using parallel with blast alignment software.


I am trying to use GNU parallel v. 20150122 with blast for a very large sequences alignment. I am using Parallel on a cluster which uses LSF as queue system.

Hello Giuseppe,

I am an avid LSF user, and I want to use GNU Parallel to dispatch jobs to LSF. Could you please explain a little bit to me how GNU Parallel works with LSF? I do not see it in the on-line tutorials. For example, I would like to understand how to pass "bsub" options like -oo, -q queue_name, etc. to LSF from GNU Parallel.

Thanks,
Martin







reply via email to

[Prev in Thread] Current Thread [Next in Thread]