Snakemake 5: Parameters
In a typical bioinformatics project, considerable efforts are spent
on tweaking parameters for the various programs involved. It would be
inconvenient if you had to change in the shell scripts themselves every
time you wanted to run with a new setting. Luckily, there is a better
option for this: the params
keyword.
rule some_rule:
output:"..."
input:
"..."
params:=2.5
cutoff
shell:"""
some_program --cutoff {params.cutoff} {input} {output}
"""
We run most of the programs with default settings in our workflow.
However, there is one parameter in the rule
get_SRA_by_accession
that we use for determining how many
reads we want to retrieve from SRA for each sample
(-X 25000
). Change in this rule to use the parameter
max_reads
instead and set the value to 20000. If you need
help, click to show the solution below.
Click to show
rule get_SRA_by_accession:"""
Retrieve a single-read FASTQ file from SRA (Sequence Read Archive) by run accession number.
"""
output:"data/raw_internal/{sample_id}.fastq.gz"
params:= 20000
max_reads
shell:"""
fastq-dump {wildcards.sample_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output}
"""
Now run through the workflow. Because there’s been changes to the
get_SRA_by_accession
rule this will trigger a re-run of the
rule for all three accessions. In addition all downstream rules that
depend on output from get_SRA_by_accession
are re-run.
The parameter values we set in the params
section don’t
have to be static, they can be any Python expression. In particular,
Snakemake provides a global dictionary of configuration parameters
called config
. Let’s modify
get_SRA_by_accession
to look something like this in order
to make use of this dictionary:
rule get_SRA_by_accession:"""
Retrieve a single-read FASTQ file from SRA (Sequence Read Archive) by run accession number.
"""
output:"data/raw_internal/{sample_id}.fastq.gz"
params:= config["max_reads"]
max_reads
shell:"""
fastq-dump {wildcards.sample_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output}
"""
Note that Snakemake now expects there to be a key named
max_reads
in the config dictionary. If we don’t populate
the dictionary somehow the dictionary will be empty so if you were to
run the workflow now it would trigger a KeyError
(try
running snakemake -s snakefile_mrsa.smk -n
to see for
yourself). In order to populate the config dictionary with data for the
workflow we could use the snakemake --config KEY=VALUE
syntax directly from the command line. However, from a reproducibility
perspective, it’s not optimal to set parameters from the command line,
since it’s difficult to keep track of which parameter values that were
used.
A much better alternative is to use the
--configfile FILE
option to supply a configuration file to
Snakemake. In this file we can collect all the project-specific
settings, sample ids and so on. This also enables us to write the
Snakefile in a more general manner so that it can be better reused
between projects. Like several other files used in these tutorials, this
file should be in yaml
format
Links to an external site.. Create the file below and save it as
config.yml
.
max_reads: 25000
If we now run Snakemake with --configfile config.yml
, it
will parse this file to form the config
dictionary. If you
want to overwrite a parameter value, e.g. for testing, you can
still use the --config KEY=VALUE
flag, as in
--config max_reads=1000
.
Tip
Rather than supplying the config file from the command line you could also add the lineconfigfile: "config.yml"
to the top of your Snakefile. Keep in mind that with such a setup Snakemake will complain if the fileconfig.yml
is not present.
Quick recap
In this section we’ve learned:
- How to set parameter values with the
params
directive.- How to run Snakemake with the
config
variable and with a configuration file.