Snakemake 11: Extra material
If you want to read more about Snakemake in general you can find several resources here:
- The Snakemake documentation is available on ReadTheDocs Links to an external site..
- Here is another (quite in-depth) tutorial Links to an external site..
- If you have questions, check out stack overflow Links to an external site..
Using containers in Snakemake
Snakemake also supports defining a Singularity or Docker container
for each rule (you will have time to work on the Containers tutorial later during
the course). Analogous to using a rule-specific Conda environment,
specify
container: "docker://some-account/rule-specific-image"
in
the rule definition. Instead of a link to a container image, it is also
possible to provide the path to a *.sif
file (= a
Singularity file). When executing Snakemake, add the
--use-singularity
flag to the command line. For the given
rule, a Singularity container will then be created from the image or
Singularity file that is provided in the rule definition on the fly by
Snakemake and the rule will be run in this container.
You can find pre-made Singularity or Docker images for many tools on https://biocontainers.pro/ Links to an external site. (bioinformatics-specific) or on https://hub.docker.com/ Links to an external site..
Here is an example for a rule and its execution:
rule align_to_genome:
output:"results/bam/{sample_id,\w+}.bam")
temp(input:
= "data/{sample_id}.fastq.gz",
fastq = expand("results/bowtie2/{genome_id}.{substr}.bt2",
index =config["genome_id"],
genome_id=["1", "2", "3", "4", "rev.1", "rev.2"])
substr
log:"results/logs/align_to_genome/{{sample_id}}_{genome_id}.log",
expand(= config["genome_id"])
genome_id "docker://quay.io/biocontainers/bowtie2:2.5.0--py310h8d7afc0_0"
container:
shell:"""
bowtie2 -x results/bowtie2/{config[genome_id]} -U {input.fastq} > {output} 2>{log}
"""
Start your Snakemake workflow with the following command:
snakemake --use-singularity
Feel free to modify the MRSA workflow according to this example. As Singularity is a container software that was developed for HPC clusters, and for example the Mac version is still a beta version, it might not work to run your updated Snakemake workflow with Singularity locally on your computer. In the next section we explain how you can run Snakemake workflows on UPPMAX where Singularity is pre-installed.
Running Snakemake workflows on UPPMAX
There are several options to execute Snakemake workflows on UPPMAX (a HPC cluster with the SLURM workload manager). In any case, we highly recommend to use a session manager like tmux Links to an external site. or screen Links to an external site. so that you can run your workflow in a session in the background while doing other things on the cluster or even logging out of the cluster.
Run your workflow in an interactive job
For short workflows with only a few rules that need the same compute
resources in terms of CPU (cores), you can start an interactive job (in
your tmux
or screen
session) and run your
Snakemake workflow as you would do that on your local machine. Make sure
to give your interactive job enough time to finish running all rules of
your Snakemake workflow.
Use built-in SLURM support
For workflows with long run times and/or where each rule requires different compute resources, Snakemake can be configured to automatically send each rule as a job to the SLURM queue and to track the status of each job.
Since version 7.19.0 Snakemake comes with built-in support for
execution on compute clusters with the SLURM workload manager. To enable
this you supply the --slurm
flag to your Snakemake command.
In addition you need to specify the id
(_e.g. snic-2023-01-001
) for your compute project. This can
be done directly on the command line with
--default-resources slurm_account=snic2023-01-001
. You also
need to specify the number of jobs that Snakemake will queue at the same
time with -j
, e.g. -j 100
to allow up
to 100 jobs to be put into the queue at any given time. So the command
would be (in addition to any other flags you may want to use):
snakemake --slurm --default-resources slurm_account=snic2023-01-001 -j 100
Snakemake will submit each job to the SLURM queue and inform you about both the local jobid and the SLURM jobid by writing something similar to this to your terminal:
Job 0 has been submitted with SLURM jobid 37099380 (log: .snakemake/slurm_logs/rule_name/37099380.log).
In this example the log output from the job will be in
.snakemake/slurm_logs/rule_name/37099380.log
.
So how do you specify SLURM resources such as runtime, CPUs
etc.? The best way to do that is to use the
resources:
and threads:
directives in the
rules of your workflow. This allows you to fine-tune jobs to run with
individual runtime and CPU usage. Take a look at the example rule
below:
rule testrule:
output:"results/output.txt"
resources:= 60
runtime 4
threads:
shell:"""
uname -a > {output}
"""
This rule uses the standard resource runtime
to set the
maximum allowed time for the rule to 60 minutes and sets the number of
threads to 4. This means that the rule will have a time limit of 60
minutes and will use 4 CPUs.
Of course you could set the runtime and threads using a configuration file as we have shown in earlier sections of this tutorial, e.g. with a config file that contains:
testrule:
threads: 4
runtime: 60
followed by updating the rule to:
rule testrule:
output:"results/output.txt"
resources:= config["testrule"]["runtime"]
runtime "testrule"]["threads"]
threads: config[
shell:"""
uname -a > {output}
"""
Note that when using the --slurm
flag -j
only specifies the number of jobs that can be sent to the queue at any
given time, while the number of CPUs used for each job is set via the
threads:
directive.
The resources
directive can also be used to specify
constraints, for instance if jobs need to run on nodes with more memory
you can use the following on the Uppmax compute cluster:
resources:= "mem256GB" constraint
If you need to submit the job on another cluster, e.g. the
‘snowy’ cluster on Uppmax you can do so with the
slurm_extra
keyword in the resources
directive:
resources:= "-M snowy" slurm_extra
You can read more details about running Snakemake on compute clusters in the Snakemake docs Links to an external site..
SLURM Profile
As an alternative to the built-in SLURM support you can also use a configuration profile developed for SLURM, such as https://github.com/Snakemake-Profiles/slurm or the more light-weight https://github.com/jdblischak/smk-simple-slurm.