Snakemake 9: Shadow rules
Take a look at the index_genome
rule below:
rule index_genome:"""
Index a genome using Bowtie 2.
"""
output:= expand("results/bowtie2/NCTC8325.{substr}.bt2",
index = ["1", "2", "3", "4", "rev.1", "rev.2"])
substr input:
"data/NCTC8325.fa.gz"
log:"results/logs/index_genome/NCTC8325.log"
shell:"""
# Bowtie2 cannot use .gz, so unzip to a temporary file first
gunzip -c {input} > tempfile
bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}
# Remove the temporary file
rm tempfile
"""
There is a temporary file here called tempfile
which is
the uncompressed version of the input, since Bowtie2 cannot use
compressed files. There are a number of drawbacks with having files that
aren’t explicitly part of the workflow as input/output files to
rules:
- Snakemake cannot clean up these files if the job fails, as it would do for normal output files.
- If several jobs are run in parallel there is a risk that they write
to
tempfile
at the same time. This can lead to very scary results. - Sometimes we don’t know the names of all the files that a program can generate. It is, for example, not unusual that programs leave some kind of error log behind if something goes wrong.
All of these issues can be dealt with by using the
shadow
option for a rule. The shadow option results in that
each execution of the rule is run in an isolated temporary directory
(located in .snakemake/shadow/
by default). There are a few
options for shadow
(for the full list of these options see
the Snakemake
docs
Links to an external site.). The most simple is shadow: "minimal"
, which
means that the rule is executed in an empty directory that the input
files to the rule have been symlinked into. For the rule below, that
means that the only file available would be input.txt
. The
shell commands would generate the files
some_other_junk_file
and output.txt
. Lastly,
Snakemake will move the output file (output.txt
) to its
“real” location and remove the whole shadow directory. We therefore
never have to think about manually removing
some_other_junk_file
.
rule some_rule:
output:"output.txt"
input:
"input.txt"
"minimal"
shadow:
shell:"""
touch some_other_junk_file
cp {input} {output}
"""
Try this out for the rules where we have to “manually” deal with
files that aren’t tracked by Snakemake (multiqc
,
index_genome
). Also remove the shell commands that remove
temporary files from those rules, as they are no longer needed. Now
rerun the workflow and validate that the temporary files don’t show up
in your working directory.
Tip
Some people use the shadow option for almost every rule and some never use it at all. One thing to keep in mind is that it leads to some extra file operations when the outputs are moved to their final location. This is no issue when the shadow directory is on the same disk as the output directory, but if you’re running on a distributed file system and generate very many or very large files it might be worth considering other options (see e.g. the--shadow-prefix
flag).
Quick recap
In this section we’ve learned:
- How to use the shadow option to handle files that are not tracked by Snakemake.