Snakemake 4: The MRSA Workflow

As you might remember from the intro, we are attempting to understand how lytic bacteriophages can be used as a future therapy for the multi-resistant bacteria MRSA (methicillin-resistant Staphylococcus aureus). In order to do this we have performed RNA-seq of three strains, one test and two controls. We have already set up a draft Snakemake workflow for the RNA-seq analysis and it seems to be running nicely. The rest of the Snakemake tutorial will be spent improving and making this workflow more flexible!

Tip
This section will leave a little more up to you compared to the previous one. If you get stuck at some point the final workflow after all the modifications is available in tutorials/git/Snakefile.

You are probably already in your snakemake-env environment, otherwise activate it (use mamba info --envs if you are unsure).

Tip
Here we have one Conda environment for executing the whole Snakemake workflow. Snakemake also supports using explicit Conda environments on a per-rule basis, by specifying something like conda: rule-specific-env.yml in the rule definition and running Snakemake with the --use-conda flag. The given rule will then be run in the Conda environment specified in rule-specific-env.yml that will be created and activated on the fly by Snakemake. Note that by default Snakemake uses mamba to generate the rule-specific environments. This behaviour can be changed by running with --conda-frontend conda, which will force Snakemake to use conda instead.

Let’s start by generating the rule graph so that we get an overview of the workflow. Here we have to specify the file with the rules using the -s flag to Snakemake since the path to the file differs from the default.

snakemake -s snakefile_mrsa.smk --rulegraph | dot -T png > rulegraph_mrsa.png

There’s another difference in this command compared to the one we’ve used before, namely that we don’t define a target. In the toy example we used a_b.txt as a target, and the wildcards were resolved based on that. How come that we don’t need to do that here? It turns out that by default Snakemake targets the first rule in a workflow. By convention, we call this rule all and let it serve as a rule for aggregating the main outputs of the workflow.

Now take some time and look through the workflow file and try to understand how the rules fit together. Use the rule graph as aid. The rules represent a quite standard, although somewhat simplified, workflow for RNA-seq analysis. If you are unfamiliar with the purpose of the different operations (index genome, FastQC and so on), then take a look at the intro.

Also generate the job graph in the same manner. Here you can see that three samples will be downloaded: SRR935090, SRR935091, and SRR935092. The original sample files contain tens of millions of reads but for the purpose of this course we have sub-sampled them to 100,000 reads per sample, so that they are easy to manage, and made them available at the SciLifeLab Data Repository Links to an external site.. These FASTQ files will then be quality controlled with FastQC and aligned to a genome. The QC output will be aggregated with MultiQC and the alignments will be used to generate a count table, i.e. a table that shows how many reads map to each gene for each sample. This count table is then what the downstream analysis will be based on.

Now try to run the whole workflow. Hopefully you see something like this.

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job                     count    min threads    max threads
--------------------  -------  -------------  -------------
align_to_genome             3              1              1
all                         1              1              1
fastqc                      3              1              1
generate_count_table        1              1              1
generate_rulegraph          1              1              1
get_SRA_by_accession        3              1              1
get_genome_fasta            1              1              1
get_genome_gff3             1              1              1
index_genome                1              1              1
multiqc                     1              1              1
sort_bam                    3              1              1
total                      19              1              1

Select jobs to execute...

[Mon Oct 25 17:13:47 2021]
rule get_genome_fasta:
    output: data/ref/NCTC8325.fa.gz
    jobid: 6
    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T

--2021-10-25 17:13:48--  ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
           => ‘data/ref/NCTC8325.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.75
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.75|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
.
.
[lots of stuff]
.
.
localrule all:
    input: results/tables/counts.tsv, results/multiqc/multiqc.html, results/rulegraph.png
    jobid: 0
    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T

[Mon Oct 25 17:14:38 2021]
Finished job 0.
19 of 19 steps (100%) done

After everything is done, the workflow will have resulted in a bunch of files in the directories data/ and results/. Take some time to look through the structure, in particular the quality control reports in results/multiqc and the count table in results/tables.

Quick recap
In this section we’ve learned:

How the MRSA workflow looks.

How to run the MRSA workflow.

Which output files the MRSA workflow produces.