Nextflow
Nextflow is a domain-specific language (DSL) for scalable and reproducible scientific workflows.[1]
Use Cases
Nextflow is best suited for:
- workflows which contain significant complexity in the form of different applications, data formats, repetition, conditional branching, and/or dependencies between all of these
- workflows which have low maximum scaling (i.e. low total number of individual tasks and/or nodes required)
- workflows which do not require fast turnaround times
- pre-existing workflows implemented in Nextflow, e.g. downloaded from nf-core
Installed Versions
Picotte
Nextflow (multiple versions) is installed. Use the appropriate modulefile:
nextflow/21.04.0
nextflow/21.04.6
nextflow/21.10.6
Installing Your Own
Nextflow is a single executable that you can install yourself to any convenient location. Follow the instructions in the Nextflow documentation.[2]
Running Nextflow
General Notes
Nextflow may download a large amount of data, both in the form of
Singularity container images and data as input the the workflow. Due to
the space limitation on home directories (64 GiB), Nextflow workloads
should be run either from the group directory, e.g.
/ifs/groups/myrsrchGrp/something/nextflow/
, or in a manually-created
subdirectory of the BeeGFS scratch directory, e.g.
/beegfs/scratch/myusername/nextflow/
.
N.B. The BeeGFS filesystem is a "scratch" filesystem. "Scratch" means temporary (like "scratch paper" used for writing notes and calculations apart from your main notebook or homework). Files which have not been accessed for 45 days or more will be deleted with no possibility of recovery. So, any outputs which are to be retained should be copied back to the group directory at the end of the job script.
Executors
In the Nextflow framework architecture, the executor is the component that determines the system where a pipeline process is run and supervises its execution.[3]
Executors which may be used on Picotte include "local" and "slurm".
Singularity Containers
Singularity is a container technology used in HPC. It does not require administrative privileges to run.
Nextflow can use Singularity containers.[4]
For example:
$ nextflow run my_workflow.nf -with-singularity my_container.sif
Published workflows may have a singularity
profile option:
$ nextflow run nf-core/mag -profile test,singularity > OUTPUT.txt 2>&1
Local Executor in a Batch Job
The local executor is used by default. It runs the pipeline processes in the computer where Nextflow is launched.[5]
Write a normal batch job script, and include the "nextflow ...
"
commandline in it:
#!/bin/bash
#SBATCH --partition=def
#SBATCH --time=2:00:00
module load nextflow
### nextflow creates a working directory named "work" in the same directory where it was invoked
nextflow run hello
Note that the Nextflow documentation says that “[t]he processes are parallelised by spawning multiple threads and by taking advantage of multi-cores architecture provided by the CPU”. This may not be strictly accurate. Nextflow may launch multiple simultaneous independent processes (i.e. in Linux, each has its own process ID number), each of which may or may not be multithreaded.
To have Slurm allocate some number of CPU cores to the job, do the following. See sbatch documentation[6] (or man page).
### The default value of "--nodes" if left unspecified is 1 (one)
### Slurm will allocate 48 CPU cores in this instance.
#SBATCH --ntasks=48
Important note: bioinformatics workflows may or may not run over multiple nodes (i.e. individual servers); most do NOT: check the requirements and limitations of the specific workflow you intend to run.
Slurm Executor
WARNING Left to default configuration, a running Nextflow workflow manager process can generate a disruptive number of communication requests to Slurm.
“The slurm executor allows you to run your pipeline script by using the Slurm resource manager. Nextflow manages each process as a separate job that is submitted to the cluster by using the sbatch command.“[7]
This config should reduce the frequency of the Slurm requests -- name it
"nextflow.config
" in your Nextflow work directory:
process {
executor = 'slurm'
queueSize = 5
pollInterval = '5 min'
dumpInterval = '6 min'
queueStatInterval = '5 min'
exitReadTimeout = '13 min'
killBatchSize = 30
submitRateLimit = '20 min'
clusterOptions = '-t 00:30:00'
}
See Also
References
[1] Nextflow website
[2] Nextflow Documentation (latest)
[3] Nextflow documentation - Executors
[4] Nextflow Documentation - Singularity containers
[5] Nextflow documentation - Executors - Local