Snakemake 2: The basics
In this part of the tutorial we will create a very simple workflow
from scratch, in order to show the fundamentals of how Snakemake works.
The workflow will take two files as inputs, a.txt
and
b.txt
, and the purpose is to convert the text in the files
to upper case and then to concatenate them.
Run the following shell commands. The first one will make an empty
file named Snakefile
, which will later contain the
workflow. The second and third commands generate two files containing
some arbitrary text.
touch Snakefile
echo "This is a.txt" > a.txt
echo "This is b.txt" > b.txt
Then open Snakefile
in your favourite text editor. A
Snakemake workflow is based on rules which take some file(s) as input,
performs some type of operation on them, and generate some file(s) as
outputs. Here is a very simple rule that produces
a.upper.txt
as an output, using a.txt
as
input. Copy this rule to your Snakefile
and save it.
rule convert_to_upper_case:
output:"a.upper.txt"
input:
"a.txt"
shell:"""
tr [a-z] [A-Z] < {input} > {output}
"""
Attention!
Indentation is important in Snakefiles, so make sure that you have the correct number of spaces beforeinput
/output
/shell
and their respective subsections. The number of spaces per level doesn’t matter as long as you’re consistent. Here we use four, but you could just as well use two for a more compact look. Don’t use tabs (unless your editor automatically converts them to spaces).
Rules can be given names, here it’s
convert_to_upper_case
. While rule names are not strictly
necessary we encourage you to use them and to make an effort to name
your rules in a way that makes it easy to understand the purpose of the
rule, as rule names are one of the main ways to interact with the
workflow. The shell
section (or directive) contains the
shell commands that will convert the text in the input file to upper
case and send it to the output file. In the shell command string, we can
refer to elements of the rule via curly brackets. Here, we refer to the
output file by specifying {output}
and to the input file by
specifying {input}
. If you’re not very familiar with Bash,
this particular command can be read like “send the contents of
a.txt
to the program tr
, which will convert
all characters in the set [a-z]
to the corresponding
character in the set [A-Z]
, and then send the output to
a.upper.txt
”.
Now let’s run our first Snakemake workflow. When a workflow is
executed Snakemake tries to generate a set of target files. Target files
can be specified via the command line (or, as you will see later, in
several other ways). Here we ask Snakemake to make the file
a.upper.txt
. We can specify the file containing our rules
with -s
but since the default behaviour of Snakemake is to
look for a file called Snakefile
in either the working
directory or in a subdirectory called workflow/
we don’t
need to specify that here. It’s good practice to first run with the flag
-n
(or --dry-run
), which will show what
Snakemake plans to do without actually running anything, and you also
need to specify how many cores to be used for the workflow with
--cores
or -c
. For now, you only need 1 so set
-c 1
. You can also use the flag -p
, for
showing the shell commands that it will execute, and the flag
-r
for showing the reason for running a specific rule.
snakemake --help
will show you all available flags.
$ snakemake -n -c 1 -r -p a.upper.txt
Building DAG of jobs...
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
convert_to_upper_case 1 1 1
total 1 1 1
[Mon Oct 25 16:48:43 2021]
rule convert_to_upper_case:
input: a.txt
output: a.upper.txt
jobid: 0
reason: Missing output files: a.upper.txt
resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
tr [a-z] [A-Z] < a.txt > a.upper.txt
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
convert_to_upper_case 1 1 1
total 1 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
You can see that Snakemake plans to run one job: the rule
convert_to_upper_case
with a.txt
as input and
a.upper.txt
as output. The reason for doing this is that
it’s missing the file a.upper.txt
. Now execute the workflow
without the -n
flag and check that the contents of
a.upper.txt
is as expected. Then try running the same
command again. What do you see? It turns out that Snakemake only reruns
jobs if there have been changes to either the input files, or
the workflow itself. This is how Snakemake ensures that
everything in the workflow is up to date. We will get back to this
shortly.
What if we ask Snakemake to generate the file
b.upper.txt
?
$ snakemake -n -c 1 -r -p b.upper.txt
Building DAG of jobs...
MissingRuleException:
No rule to produce b.upper.txt (if you use input functions make sure that they don't raise unexpected exceptions).
That didn’t work well. We could copy the rule to make a similar one
for b.txt
, but that would be a bit cumbersome. Here is
where named wildcards come in; one of the most powerful features of
Snakemake. Simply change the input from input: "a.txt"
to
input: "{some_name}.txt"
and the output to
output: "{some_name}.upper.txt"
. Now try asking for
b.upper.txt
again.
Tada! What happens here is that Snakemake looks at all the rules it
has available (actually only one in this case) and tries to assign
values to all wildcards so that the targeted files can be generated. In
this case it was quite simple, you can see that it says that
wildcards: some_name=b
, but for large workflows and
multiple wildcards it can get much more complex. Named wildcards is what
enables a workflow (or single rules) to be efficiently generalized and
reused between projects or shared between people.
It seems we have the first part of our workflow working, now it’s
time to make the second rule for concatenating the outputs from
convert_to_upper_case
. The rule structure will be similar;
the only difference is that here we have two inputs instead of one. This
can be expressed in two ways, either with named inputs like this:
input:
="...",
firstFile="..."
secondFile
shell:"""
some_function {input.firstFile} {input.secondFile}
"""
Or with indexes like this:
input:
"...",
"..."
shell:"""
some_function {input[0]} {input[1]}
"""
Attention!
If you have multiple inputs or outputs they need to be delimited with a comma (as seen above). This is a very common mistake when writing Snakemake workflows. The parser will complain, but sometimes the error message can be difficult to interpret.
Now try to construct this rule yourself and name it
concatenate_a_and_b
. The syntax for concatenating two files
in Bash is
cat first_file.txt second_file.txt > output_file.txt
.
Call the output c.txt
. Run the workflow in Snakemake and
validate that the output looks as expected.
Wouldn’t it be nice if our workflow could be used for any
files, not just a.txt
and b.txt
? We can
achieve this by using named wildcards (or in other ways as we will
discuss later). As we’ve mentioned, Snakemake looks at all the rules it
has available and tries to assign values to all wildcards so that the
targeted files can be generated. We therefore have to name the output
file in a way so that it also contains information about which input
files it should be based on. Try to figure out how to do this yourself.
If you’re stuck you can look at the spoiler below, but spend some time
on it before you look. Also rename the rule to
concatenate_files
to reflect its new more general use.
Click to show
rule concatenate_files:
output:"{first}_{second}.txt"
input:
"{first}.upper.txt",
"{second}.upper.txt"
shell:"""
cat {input[0]} {input[1]} > {output}
"""
We can now control which input files to use by the name of the file
we ask Snakemake to generate. Run the workflow without the flag
-n
(or --dry-run
) to execute both rules,
providing one core with -c 1
(or
--cores 1
):
$ snakemake a_b.txt -c 1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
--------------------- ------- ------------- -------------
concatenate_files 1 1 1
convert_to_upper_case 2 1 1
total 3 1 1
Select jobs to execute...
[Mon Oct 25 16:51:52 2021]
rule convert_to_upper_case:
input: b.txt
output: b.upper.txt
jobid: 2
wildcards: some_name=b
resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
[Mon Oct 25 16:51:53 2021]
Finished job 2.
1 of 3 steps (33%) done
Select jobs to execute...
[Mon Oct 25 16:51:53 2021]
rule convert_to_upper_case:
input: a.txt
output: a.upper.txt
jobid: 1
wildcards: some_name=a
resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
[Mon Oct 25 16:51:53 2021]
Finished job 1.
2 of 3 steps (67%) done
Select jobs to execute...
[Mon Oct 25 16:51:53 2021]
rule concatenate_files:
input: a.upper.txt, b.upper.txt
output: a_b.txt
jobid: 0
wildcards: first=a, second=b
resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
[Mon Oct 25 16:51:53 2021]
Finished job 0.
3 of 3 steps (100%) done
Neat!
Tip
You can name a file whatever you want in a Snakemake workflow, but you will find that everything falls into place much nicer if the filename reflects the file’s path through the workflow, e.g.sample_a.trimmed.deduplicated.sorted.bam
.
The input to Snakemake rules have to be strings or lists of strings,
however you don’t have to specify these strings directly in the
input:
section of rules. Instead, you can specify Python
functions that return strings or lists of strings. This allows you to
supply input to rules that can vary depending on the wildcards being
used. We’ll get to why that’s useful in a sec, but first let’s put it to
use for the conatenate_files
rule. Because Snakemake is
based on Python we can mix rule definitions with standard python code in
the same file. Add a function just above the
concatenate_files
that looks like this:
def concat_input(wildcards):
= [wildcards.first + ".upper.txt", wildcards.second + ".upper.txt"]
files return files
This is the syntax to define a function in Python. The
def concat_input(wildcards):
line shows the name of the
function (concat_input
) and the variable passed to the
function (the wildcards
object). In the second line we add
two items to a list that we call files
and add the
‘.upper.txt’ suffix to each item. Finally, the function returns the
list. Because the concatenate_files
rule has two wildcards
{first}
and {second}
we can access the actual
strings in the wildcards
object using
wildcards.first
and wildcards.second
. When we
ask for the file a_b.txt
then
wildcards.first == 'a'
and
wildcards.second == 'b'
. This means that the
files
list returned by the function will be
['a.upper.txt', 'b.upper.txt']
. To see for yourself you can
add the following line to the function, just before the return
statement:
print (wildcards.first, wildcards.second, files)
. This way
the wildcard values and the list will be printed to the terminal when
you run Snakemake.
Now that we’ve defined the function to use as input, we can use it in
the concatenate_files
rule. Update the rule so that it
looks like this:
rule concatenate_files:
output:"{first}_{second}.txt"
input:
concat_input
shell:"""
cat {input[0]} {input[1]} > {output}
"""
You see that the name of the function concat_input
is
added in place of the input strings. When using the
wildcards
object in input functions like this we have to
call the function without any arguments (simply
concat_input
) and the function has to be defined to accept
a single argument (here def concat_input(wildcards):
).
Let’s run the workflow with the updated rule. Remove the file
a_b.txt
or add -f
to the Snakemake command to
force a re-run:
snakemake a_b.txt -c 1 -f
If you added the print
statement to the function you
should see the following printed to your terminal:
Building DAG of jobs...
a b ['a.upper.txt', 'b.upper.txt']
Followed by the rest of the workflow output.
There are a number of possible use-cases for input functions. For
example, say that you have an experiment where you’ve sequenced three
samples: sample1
, sample2
and
sample3
with the corresponding FASTQ files under
data/
and you want to write a rule that outputs the
statistics of all sequences within each sample. However, samples
sample1
and sample2
have been sequenced with
single-end technology while sample3
have paired-end reads.
The single-end samples will have only one FASTQ file whereas the
paired-end sample will have two (one for each sequenced end). Thus,
depending on the name of the sample the input to the function will
either be one file or two. With input functions we can write a
generalized rule that can handle both types:
def fastq_input(wildcards):
if wildcards.sample_id in ["sample1", "sample2"]:
return "data/" + wildcards.sample_id + ".fastq.gz"
else:
return ["data/" + wildcards.sample_id + ".R1.fastq.gz",
"data/" + wildcards.sample_id + ".R2.fastq.gz"]
rule fastq_stats:
output:"{sample_id}.stats.txt"
input:
fastq_input
shell:"""
seqtk comp {input} > {output}
"""
As you can see, the fastq_stats
rule outputs one file
{sample_id}.stats.txt
and takes as input the value returned
from the fastq_input
function. In this function the sample
id is evaluated and if it is either sample1
or
sample2
(our single-end samples) then the function returns
a single string which is the path to the FASTQ file for that sample.
Otherwise, the function returns a list containing both the
R1
and R2
files for the sample. In the
shell:
directive of the rule the seqtk comp
command is run on the input and the output is sent to the output
file.
Quick recap
In this section we’ve learned:
- How a simple Snakemake rule looks.
- How to define target files when executing a workflow.
- How to use named wildcards for writing generic and flexible rules.
- How to use input functions in rules