Snakemake 3: Visualising workflows
All that we’ve done so far could quite easily be done in a simple shell script that takes the input files as parameters. Let’s now take a look at some of the features where a WfMS like Snakemake really adds value compared to a more straightforward approach. One such feature is the possibility to visualize your workflow. Snakemake can generate two types of graphs, one that shows how the rules are connected and one that shows how the jobs (i.e. an execution of a rule with some given inputs/outputs/settings) are connected.
First we look at the rule graph. The following command will generate
a rule graph in the dot language and pipe it to the program
dot
, which in turn will save a visualization of the graph
as a PNG file (if you’re having troubles displaying PNG files you could
use SVG or JPG instead).
snakemake --rulegraph a_b.txt | dot -Tpng > rulegraph.png
This looks simple enough, the output from the rule
convert_to_upper_case
will be used as input to the rule
concatenate_files
.
For a more typical bioinformatics project it can look something like this when you include all the rules from processing of the raw data to generating figures for the paper.
While saying that it’s easy to read might be a bit of a stretch, it definitely gives you a better overview of the project than you would have without a WfMS.
The second type of graph is based on the jobs, and looks like this
for our little workflow (use --dag
instead of
--rulegraph
).
snakemake --dag a_b.txt | dot -Tpng > jobgraph.png
The main difference here is that now each node is a job instead of a
rule. You can see that the wildcards used in each job are also
displayed. Another difference is the dotted lines around the nodes. A
dotted line is Snakemake’s way of indicating that this rule doesn’t need
to be rerun in order to generate a_b.txt
. Validate this by
running snakemake -n -r a_b.txt
and it should say that
there is nothing to be done.
We’ve discussed before that one of the main purposes of using a WfMS
is that it automatically makes sure that everything is up to date. This
is done by recursively checking that outputs are always newer than
inputs for all the rules involved in the generation of your target
files. Now try to change the contents of a.txt
to some
other text and save it. What do you think will happen if you run
snakemake -n -r a_b.txt
again?
Click to show
$ snakemake -n -r a_b.txt
Building DAG of jobs...
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
concatenate_files 1 1 1
convert_to_upper_case 1 1 1
total 2 1 1
[Mon Oct 25 17:00:02 2021]
rule convert_to_upper_case:
input: a.txt
output: a.upper.txt
jobid: 1
reason: Updated input files: a.txt
wildcards: some_name=a
resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
[Mon Oct 25 17:00:02 2021]
rule concatenate_files:
input: a.upper.txt, b.upper.txt
output: a_b.txt
jobid: 0
reason: Input files updated by another job: a.upper.txt
wildcards: first=a, second=b
resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
concatenate_files 1 1 1
convert_to_upper_case 1 1 1
total 2 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Were you correct? Also generate the job graph and compare to the one
generated above. What’s the difference? Now rerun without
-n
and validate that a_b.txt
contains the new
text (don’t forget to specify -c 1
). Note that Snakemake
doesn’t look at the contents of files when trying to determine what has
changed, only at the timestamp for when they were last modified.
We’ve seen that Snakemake keeps track of if files in the workflow
have changed, and automatically makes sure that any results depending on
such files are regenerated. What about if the rules themselves are
changed? It turns out that there are multiple ways to deal with this,
but the most straightforward is to manually specify that you want to
rerun a rule (and thereby also all the steps between that rule and your
target). Let’s say that we want to modify the rule
concatenate_files
to also include which files were
concatenated.
rule concatenate_files:input:
"{first}.upper.txt",
"{second}.upper.txt"
output:"{first}_{second}.txt"
shell:"""
echo 'Concatenating {input}' | cat - {input[0]} {input[1]} > {output}
"""
Note
It’s not really important for the tutorial, but the shell command used here first outputs “Concatenating” followed by a space delimited list of the files ininput
. This string is then sent to the programcat
where it’s concatenated withinput[0]
andinput[1]
(the parameter-
means that it should read from standard input). Lastly, the output fromcat
is sent to{output}
.
If you now run the workflow as before you should see:
"Nothing to be done" (all requested files are present and up to date).
because no files involved in the workflow have been changed. But there’s also a warning that Snakemake has detected a change in the code used to generate output files in the workflow, and is suggesting different ways to address this:
The code used to generate one or several output files has changed:
To inspect which output files have changes, run 'snakemake --list-code-changes'.
To trigger a re-run, use 'snakemake -R $(snakemake --list-code-changes)'.
The first suggestion is to use the --list-code-changes
flag. This will list the files for which the rule implementation has
changed. Try it by running:
snakemake a_b.txt --list-code-changes
You should see a_b.txt
printed to the terminal, meaning
that Snakemake has identified changes to the
concatenate_files
rule which produces the
a_b.txt
file.
The second suggestion involves triggering a re-run with the
-R
flag which will force re-creation of the given file (or
rule). Try it out by running:
snakemake a_b.txt -n -r -R $(snakemake a_b.txt --list-code-changes)
Here the file to re-create is, as we just saw, given by the
$(snakemake a_b.txt --list-code-changes)
part, which is
then used as a target for the snakemake a_b.txt -n -r -R
part. Enclosing the command inside $(...)
is called
“command substitution” and is a way to store the output of a command
inside a variable, here done on the fly together with the snakemake run.
You would have gotten the same result by first using
--list-code-changes
to get the file name, then the
-R
flag to target that file for re-creation (e.g.
-R a_b.txt
or -R concatenate_files
). However,
using command substitution saves you that extra step.
There are a bunch of these --list-xxx-changes
flags that
can help you keep track of your workflow. You can list all options with
snakemake --help
.
Whenever you’ve made changes to a rule that will affect the output it’s good practice to force re-execution like this. As of version 7.0.0 (2022-02-23), Snakemake warns you if there have been changes to the workflow, as we just saw above. This is a very helpful feature, especially in cases where several people collaborate on the same workflow but are using it on different files, for example.
You can also export information on how all files were generated (when, by which rule, which version of the rule, and by which commands) to a tab-delimited file like this:
snakemake a_b.txt -c 1 -D > summary.tsv
The content of summary.tsv
is shown in the table
below:
output_file | date | rule | version | log-file(s) | input-file(s) | shellcmd | status | plan |
a_b.txt | Mon Oct 25 17:01:46 2021 | concatenate_files | - | a.upper.txt,b.upper.txt | cat a.upper.txt b.upper.txt > a_b.txt | rule implementation changed | no update | |
a.upper.txt | Mon Oct 25 17:01:46 2021 | convert_to_upper_case | - | a.txt | tr [a-z] [A-Z] < a.txt > a.upper.txt | ok | no update | |
b.upper.txt | Mon Oct 25 17:01:46 2021 | convert_to_upper_case | - | b.txt | tr [a-z] [A-Z] < b.txt > b.upper.txt | ok | no update |
You can see in the second last column that the rule implementation
for a_b.txt
has changed. The last column shows if Snakemake
plans to regenerate the files when it’s next executed. None of the files
will be regenerated because Snakemake doesn’t regenerate files by
default if the rule implementation changes. From a reproducibility
perspective maybe it would be better if this was done automatically, but
it would be very computationally expensive and cumbersome if you had to
rerun your whole workflow every time you fix a spelling mistake in a
comment somewhere. So, it’s up to us to look at the summary table and
rerun things as needed.
You might wonder where Snakemake keeps track of all these things? It
stores all information in a hidden subdirectory called
.snakemake
. This is convenient since it’s easy to delete if
you don’t need it anymore and everything is contained in the project
directory. Just be sure to add it to .gitignore
so that you
don’t end up tracking it with git.
By now you should be familiar with the basic functionality of Snakemake, and you can build advanced workflows with only the features we have discussed here. There’s a lot we haven’t covered though, in particular when it comes to making your workflow more reusable. In the following section we will start with a workflow that is fully functional but not very flexible. We will then gradually improve it, and at the same time showcase some Snakemake features we haven’t discussed yet. Note that this can get a little complex at times, so if you felt that this section was a struggle then you could move on to one of the other tutorials instead.
Quick recap
In this section we’ve learned:
- How to use
--dag
and--rulegraph
for visualizing the job and rule graphs, respectively.- How to force Snakemake to rerun relevant parts of the workflow after there have been changes.
- How Snakemake tracks changes to files and code in a workflow