Snakemake 8: Targets
We’ve mentioned that Snakemake rules take either strings or a list of strings as input, and that we can use any Python expression in Snakemake workflows. Here we’ll show how these features help us condense the code of rules.
Consider the rule align_to_genome
below.
rule align_to_genome:"""
Align a fastq file to a genome index using Bowtie 2.
"""
output:"results/bam/{sample_id}.bam"
input:
"data/{sample_id}.fastq.gz",
"results/bowtie2/NCTC8325.1.bt2",
"results/bowtie2/NCTC8325.2.bt2",
"results/bowtie2/NCTC8325.3.bt2",
"results/bowtie2/NCTC8325.4.bt2",
"results/bowtie2/NCTC8325.rev.1.bt2",
"results/bowtie2/NCTC8325.rev.2.bt2"
shell:"""
bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output}
"""
Here we have seven inputs; the FASTQ file with the reads and six files with similar file names from the Bowtie2 genome indexing. Instead of writing all the filenames we can tidy this up by using a Python expression to generate a list of these files instead. If you’re familiar with Python you could do this with list comprehensions Links to an external site. like this:
input:
"data/{sample_id}.fastq.gz",
f"results/bowtie2/NCTC8325.{substr}.bt2" for
[in ["1", "2", "3", "4", "rev.1", "rev.2"]] substr
This will take the elements of the list of substrings one by one, and
insert that element in the place of {substr}
. Since this
type of aggregating rules are quite common, Snakemake also has a more
compact way of achieving the same thing.
input:
"data/{sample_id}.fastq.gz",
"results/bowtie2/NCTC8325.{substr}.bt2",
expand(= ["1", "2", "3", "4", "rev.1", "rev.2"]) substr
Important!
When using expand() like this,substr
is not a wildcard because it is resolved to the values explicitly given inside the expand expression.
Now change in the rules index_genome
and
align_to_genome
to use the expand()
expression.
In the workflow we decide which samples to run by including the SRR
ids in the names of the inputs to the rules multiqc
and
generate_count_table
:
rule generate_count_table:
output:"results/tables/counts.tsv"
input:
= ["results/bam/SRR935090.sorted.bam",
bams "results/bam/SRR935091.sorted.bam",
"results/bam/SRR935092.sorted.bam"],
...
rule multiqc:
output:= "results/multiqc/multiqc.html",
html = "results/multiqc/multiqc_general_stats.txt"
stats input:
"results/fastqc/SRR935090_fastqc.zip",
"results/fastqc/SRR935091_fastqc.zip",
"results/fastqc/SRR935092_fastqc.zip"
The output files from these two rules,
results/multiqc.html
and
results/tables/counts.tsv
, are in turn specified as input
to the all
rule at the top of the file. Because the first
rule is targeted by default when we run Snakemake on the command line
(like we mentioned in snakemake-4-the-mrsa-workflow)
this is what triggers the rules to run on each of the three samples.
However, this is a potential source of errors since it’s easy to
change in one place and forget to change in the other. Because we can
use Python code “everywhere” let’s instead define a list of sample ids
and put at the very top of the Snakefile, just before the rule
all
:
= ["SRR935090", "SRR935091", "SRR935092"] SAMPLES
Now use expand()
in multiqc
and
generate_count_table
to use SAMPLES
for the
sample ids. For the multiqc
rule it could look like
this:
input:
"results/fastqc/{sample_id}_fastqc.zip", sample_id = SAMPLES) expand(
See if you can update the generate_count_table
rule in
the same manner!
Quick recap
In this section we’ve learned:
- How to use the
expand()
expression to create a list with file names, inserting all provided wildcard values.