Snakemake 3: Visualising workflows

All that we’ve done so far could quite easily be done in a simple shell script that takes the input files as parameters. Let’s now take a look at some of the features where a WfMS like Snakemake really adds value compared to a more straightforward approach. One such feature is the possibility to visualize your workflow. Snakemake can generate two types of graphs, one that shows how the rules are connected and one that shows how the jobs (i.e. an execution of a rule with some given inputs/outputs/settings) are connected.

First we look at the rule graph. The following command will generate a rule graph in the dot language and pipe it to the program dot, which in turn will save a visualization of the graph as a PNG file (if you’re having troubles displaying PNG files you could use SVG or JPG instead).

snakemake --rulegraph a_b.txt | dot -Tpng > rulegraph.png

This looks simple enough, the output from the rule convert_to_upper_case will be used as input to the rule concatenate_files.

For a more typical bioinformatics project it can look something like this when you include all the rules from processing of the raw data to generating figures for the paper.

While saying that it’s easy to read might be a bit of a stretch, it definitely gives you a better overview of the project than you would have without a WfMS.

The second type of graph is based on the jobs, and looks like this for our little workflow (use --dag instead of --rulegraph).

snakemake --dag a_b.txt | dot -Tpng > jobgraph.png

The main difference here is that now each node is a job instead of a rule. You can see that the wildcards used in each job are also displayed. Another difference is the dotted lines around the nodes. A dotted line is Snakemake’s way of indicating that this rule doesn’t need to be rerun in order to generate a_b.txt. Validate this by running snakemake -n -r a_b.txt and it should say that there is nothing to be done.

We’ve discussed before that one of the main purposes of using a WfMS is that it automatically makes sure that everything is up to date. This is done by recursively checking that outputs are always newer than inputs for all the rules involved in the generation of your target files. Now try to change the contents of a.txt to some other text and save it. What do you think will happen if you run snakemake -n -r a_b.txt again?

Click to show
$ snakemake -n -r a_b.txt

Building DAG of jobs...
Job stats:
job                      count    min threads    max threads
---------------------  -------  -------------  -------------
concatenate_files            1              1              1
convert_to_upper_case        1              1              1
total                        2              1              1


[Mon Oct 25 17:00:02 2021]
rule convert_to_upper_case:
    input: a.txt
    output: a.upper.txt
    jobid: 1
    reason: Updated input files: a.txt
    wildcards: some_name=a
    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T


[Mon Oct 25 17:00:02 2021]
rule concatenate_files:
    input: a.upper.txt, b.upper.txt
    output: a_b.txt
    jobid: 0
    reason: Input files updated by another job: a.upper.txt
    wildcards: first=a, second=b
    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T

Job stats:
job                      count    min threads    max threads
---------------------  -------  -------------  -------------
concatenate_files            1              1              1
convert_to_upper_case        1              1              1
total                        2              1              1

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Were you correct? Also generate the job graph and compare to the one generated above. What’s the difference? Now rerun without -n and validate that a_b.txt contains the new text (don’t forget to specify -c 1). Note that Snakemake doesn’t look at the contents of files when trying to determine what has changed, only at the timestamp for when they were last modified.

We’ve seen that Snakemake keeps track of if files in the workflow have changed, and automatically makes sure that any results depending on such files are regenerated. What about if the rules themselves are changed? It turns out that there are multiple ways to deal with this, but the most straightforward is to manually specify that you want to rerun a rule (and thereby also all the steps between that rule and your target). Let’s say that we want to modify the rule concatenate_files to also include which files were concatenated.

rule concatenate_files:
    input:
        "{first}.upper.txt",
        "{second}.upper.txt"
    output:
        "{first}_{second}.txt"
    shell:
        """
        echo 'Concatenating {input}' | cat - {input[0]} {input[1]} > {output}
        """

Note
It’s not really important for the tutorial, but the shell command used here first outputs “Concatenating” followed by a space delimited list of the files in input. This string is then sent to the program cat where it’s concatenated with input[0] and input[1] (the parameter - means that it should read from standard input). Lastly, the output from cat is sent to {output}.

If you now run the workflow as before you should see:

"Nothing to be done" (all requested files are present and up to date).

because no files involved in the workflow have been changed. But there’s also a warning that Snakemake has detected a change in the code used to generate output files in the workflow, and is suggesting different ways to address this:

The code used to generate one or several output files has changed:
    To inspect which output files have changes, run 'snakemake --list-code-changes'.
    To trigger a re-run, use 'snakemake -R $(snakemake --list-code-changes)'.

The first suggestion is to use the --list-code-changes flag. This will list the files for which the rule implementation has changed. Try it by running:

snakemake a_b.txt --list-code-changes

You should see a_b.txt printed to the terminal, meaning that Snakemake has identified changes to the concatenate_files rule which produces the a_b.txt file.

The second suggestion involves triggering a re-run with the -R flag which will force re-creation of the given file (or rule). Try it out by running:

snakemake a_b.txt -n -r -R $(snakemake a_b.txt --list-code-changes)

Here the file to re-create is, as we just saw, given by the $(snakemake a_b.txt --list-code-changes) part, which is then used as a target for the snakemake a_b.txt -n -r -R part. Enclosing the command inside $(...) is called “command substitution” and is a way to store the output of a command inside a variable, here done on the fly together with the snakemake run. You would have gotten the same result by first using --list-code-changes to get the file name, then the -R flag to target that file for re-creation (e.g. -R a_b.txt or -R concatenate_files). However, using command substitution saves you that extra step.

There are a bunch of these --list-xxx-changes flags that can help you keep track of your workflow. You can list all options with snakemake --help.

Whenever you’ve made changes to a rule that will affect the output it’s good practice to force re-execution like this. As of version 7.0.0 (2022-02-23), Snakemake warns you if there have been changes to the workflow, as we just saw above. This is a very helpful feature, especially in cases where several people collaborate on the same workflow but are using it on different files, for example.

You can also export information on how all files were generated (when, by which rule, which version of the rule, and by which commands) to a tab-delimited file like this:

snakemake a_b.txt -c 1 -D > summary.tsv

The content of summary.tsv is shown in the table below:

output_file date rule version log-file(s) input-file(s) shellcmd status plan
a_b.txt Mon Oct 25 17:01:46 2021 concatenate_files - a.upper.txt,b.upper.txt cat a.upper.txt b.upper.txt > a_b.txt rule implementation changed no update
a.upper.txt Mon Oct 25 17:01:46 2021 convert_to_upper_case - a.txt tr [a-z] [A-Z] < a.txt > a.upper.txt ok no update
b.upper.txt Mon Oct 25 17:01:46 2021 convert_to_upper_case - b.txt tr [a-z] [A-Z] < b.txt > b.upper.txt ok no update

You can see in the second last column that the rule implementation for a_b.txt has changed. The last column shows if Snakemake plans to regenerate the files when it’s next executed. None of the files will be regenerated because Snakemake doesn’t regenerate files by default if the rule implementation changes. From a reproducibility perspective maybe it would be better if this was done automatically, but it would be very computationally expensive and cumbersome if you had to rerun your whole workflow every time you fix a spelling mistake in a comment somewhere. So, it’s up to us to look at the summary table and rerun things as needed.

You might wonder where Snakemake keeps track of all these things? It stores all information in a hidden subdirectory called .snakemake. This is convenient since it’s easy to delete if you don’t need it anymore and everything is contained in the project directory. Just be sure to add it to .gitignore so that you don’t end up tracking it with git.

By now you should be familiar with the basic functionality of Snakemake, and you can build advanced workflows with only the features we have discussed here. There’s a lot we haven’t covered though, in particular when it comes to making your workflow more reusable. In the following section we will start with a workflow that is fully functional but not very flexible. We will then gradually improve it, and at the same time showcase some Snakemake features we haven’t discussed yet. Note that this can get a little complex at times, so if you felt that this section was a struggle then you could move on to one of the other tutorials instead.

Quick recap
In this section we’ve learned:

  • How to use --dag and --rulegraph for visualizing the job and rule graphs, respectively.
  • How to force Snakemake to rerun relevant parts of the workflow after there have been changes.
  • How Snakemake tracks changes to files and code in a workflow