Content from Running commands with Snakemake
Last updated on 2024-02-01 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- “How do I run a simple command with Snakemake?”
Objectives
- “Create a Snakemake recipe (a Snakefile)”
What is the workflow I’m interested in?
In this lesson we will make an experiment that takes an application which runs in parallel and investigate it’s scalability. To do that we will need to gather data, in this case that means running the application multiple times with different numbers of CPU cores and recording the execution time. Once we’ve done that we need to create a visualisation of the data to see how it compares against the ideal case.
From the visualisation we can then decide at what scale it makes most sense to run the application at in production to maximise the use of our CPU allocation on the system.
We could do all of this manually, but there are useful tools to help us manage data analysis pipelines like we have in our experiment. Today we’ll learn about one of those: Snakemake.
In order to get started with Snakemake, let’s begin by taking a
simple command and see how we can run that via Snakemake. Let’s choose
the command hostname
which prints out the name of the host
where the command is executed:
OUTPUT
node1.int.jetstream2.hpc-carpentry.org
That prints out the result but Snakemake relies on files to know the status of your workflow, so let’s redirect the output to a file:
Making a Snakefile
Edit a new text file named Snakefile
.
Contents of Snakefile
:
PYTHON
rule hostname_login:
output: "hostname_login.txt"
input:
shell:
"hostname > hostname_login.txt"
Key points about this file
- The file is named
Snakefile
- with a capitalS
and no file extension. - Some lines are indented. Indents must be with space characters, not tabs. See the setup section for how to make your text editor do this.
- The rule definition starts with the keyword
rule
followed by the rule name, then a colon. - We named the rule
hostname_login
. You may use letters, numbers or underscores, but the rule name must begin with a letter and may not be a keyword. - The keywords
input
,output
,shell
are all followed by a colon. - The file names and the shell command are all in
"quotes"
. - The output filename is given before the input filename. In fact, Snakemake doesn’t care what order they appear in but we give the output first throughout this course. We’ll see why soon.
- In this use case there is no input file for the command so we leave this blank.
Back in the shell we’ll run our new rule. At this point, if there were any missing quotes, bad indents, etc. we may see an error.
bash: snakemake: command not found...
If your shell tells you that it cannot find the command
snakemake
then we need to make the software available
somehow. In our case, this means searching for the module that we need
to load:
OUTPUT
[ocaisa@node1 ~]$ module spider snakemake
--------------------------------------------------------------------------------------------------------
snakemake:
--------------------------------------------------------------------------------------------------------
Versions:
snakemake/8.2.1-foss-2023a
snakemake/8.2.1 (E)
Names marked by a trailing (E) are extensions provided by another module.
--------------------------------------------------------------------------------------------------------
For detailed information about a specific "snakemake" package (including how to load the modules) use the module's full name.
Note that names that have a trailing (E) are extensions provided by other modules.
For example:
$ module spider snakemake/8.2.1
--------------------------------------------------------------------------------------------------------
Now we want the module, so let’s load that to make the package available
and then make sure we have the snakemake
command
available
OUTPUT
/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/snakemake/8.2.1-foss-2023a/bin/snakemake
Running Snakemake
Run snakemake --help | less
to see the help for all
available options. What does the -p
option in the
snakemake
command above do?
- Protects existing output files
- Prints the shell commands that are being run to the terminal
- Tells Snakemake to only run one process at a time
- Prompts the user for the correct input file
Hint: you can search in the text by pressing /
, and
quit back to the shell with q
- Prints the shell commands that are being run to the terminal
This is such a useful thing we don’t know why it isn’t the default!
The -j1
option is what tells Snakemake to only run one
process at a time, and we’ll stick with this for now as it makes things
simpler. Answer 4 is a total red-herring, as Snakemake never prompts
interactively for user input.
Content from Running Snakemake on the cluster
Last updated on 2024-02-01 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- “How do I run my Snakemake rule on the cluster?”
Objectives
- “Define rules to run locally and on the cluster”
What happens when we want to make our rule run on the cluster rather than the login node? The cluster we are using uses Slurm, and it happens that Snakemake has built in support for Slurm, we just need to tell it that we want to use it.
Snakemake uses the executor
option to allow you to
select the plugin that you wish to execute the rule. The quickest way to
apply this to your Snakefile is to define this on the command line.
Let’s try it out
OUTPUT
Building DAG of jobs...
Retrieving input from storage.
Nothing to be done (all requested files are present and up to date).
Nothing happened! Why not? When it is asked to build a target, Snakemake checks the ‘last modification time’ of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Snakemake knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.
The rule is almost identical to the previous rule save for the rule name and output file:
PYTHON
rule hostname_remote:
output: "hostname_remote.txt"
input:
shell:
"hostname > hostname_remote.txt"
You can then execute the rule with
OUTPUT
Building DAG of jobs...
Retrieving input from storage.
Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
Provided remote nodes: 1
Job stats:
job count
--------------- -------
hostname_remote 1
total 1
Select jobs to execute...
Execute 1 jobs...
[Mon Jan 29 18:03:46 2024]
rule hostname_remote:
output: hostname_remote.txt
jobid: 0
reason: Missing output files: hostname_remote.txt
resources: tmpdir=<TBD>
hostname > hostname_remote.txt
No SLURM account given, trying to guess.
Guessed SLURM account: def-users
No wall time information given. This might or might not work on your cluster. If not, specify the resource runtime in your rule or as a reasonable default via --default-resources.
No job memory information ('mem_mb' or 'mem_mb_per_cpu') is given - submitting without. This might or might not work on your cluster.
Job 0 has been submitted with SLURM jobid 326 (log: /home/ocaisa/.snakemake/slurm_logs/rule_hostname_remote/326.log).
[Mon Jan 29 18:04:26 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-01-29T180346.788174.snakemake.log
Note all the warnings that Snakemake is giving us about the fact that
the rule may not be able to execute on our cluster as we may not have
given enough information. Luckily for us, this actually works on our
cluster and we can take a look in the output file the new rule creates,
hostname_remote.txt
:
OUTPUT
tmpnode1.int.jetstream2.hpc-carpentry.org
Snakemake profile
Adapting Snakemake to a particular environment can entail many flags and options. Therefore, it is possible to specify a configuration profile to be used to obtain default options. This looks like
The profile folder must contain a file called
config.yaml
which is what will store our options. The
folder may also contain other files necessary for the profile. Let’s
create the file cluster_profile/config.yaml
and insert some
of our existing options:
We should now be able rerun our workflow by pointing to the profile
rather than the listing out the options. To force our workflow to rerun,
we first need to remove the output file
hostname_remote.txt
, and then we can try out our new
profile
BASH
[ocaisa@node1 ~]$ rm hostname_remote.txt
[ocaisa@node1 ~]$ snakemake --profile cluster_profile hostname_remote
The profile is extremely useful in the context of our cluster, as the Slurm executor has lots of options, and sometimes you need to use them to be able to submit jobs to the cluster you have access to. Unfortunately, the names of the options in Snakemake are not exactly the same as those of Slurm, so we need the help of a translation table:
SLURM | Snakemake | Description |
---|---|---|
--partition |
slurm_partition |
the partition a rule/job is to use |
--time |
runtime |
the walltime per job in minutes |
--constraint |
constraint |
may hold features on some clusters |
--mem |
mem, mem_mb |
memory a cluster node must |
provide (mem: string with unit), mem_mb: int | ||
--mem-per-cpu |
mem_mb_per_cpu |
memory per reserved CPU |
--ntasks |
tasks |
number of concurrent tasks / ranks |
--cpus-per-task |
cpus_per_task |
number of cpus per task (in case of SMP, rather use
threads ) |
--nodes |
nodes |
number of nodes |
The warnings given by Snakemake hinted that we may need to provide
these options. One way to do it is to provide them is as part of the
Snakemake rule using the keyword resources
, e.g.,
and we can also use the profile to define default values for these
options to use with our project, using the keyword
default-resources
. For example, the available memory on our
cluster is about 4GB per core, so we can add that to our profile:
There are various sbatch
options not directly supported
by the resource definitions in the table above. You may use the
slurm_extra
resource to specify any of these additional
flags to sbatch
:
Local rule execution
Our initial rule was to get the hostname of the login node. We always want to run that rule on the login node for that to make sense. If we tell Snakemake to run all rules via the Slurm executor (which is what we are doing via our new profile) this won’t happen any more. So how do we force the rule to run on the login node?
Well, in the case where a Snakemake rule performs a trivial task job
submission might be overkill (e.g., less than 1 minute worth of compute
time). Similar to our case, it would be a better idea to have these
rules execute locally (i.e. where the snakemake
command is
run) instead of as a job. Snakemake lets you indicate which rules should
always run locally with the localrules
keyword. Let’s
define hostname_login
as a local rule near the top of our
Snakefile
.
Content from Placeholders
Last updated on 2024-02-01 | Edit this page
Estimated time: 70 minutes
Overview
Questions
- “How do I make a generic rule?”
Objectives
- “See how Snakemake deals with some errors”
Our Snakefile has some duplication. For example, the names of text files are repeated in places throughout the Snakefile rules. Snakefiles are a form of code and, in any code, repetition can lead to problems (e.g. we rename a data file in one part of the Snakefile but forget to rename it elsewhere).
D.R.Y. (Don’t Repeat Yourself)
In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python, R, or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.
Let us set about removing some of the repetition from our Snakefile.
Placeholders
To make a more general-purpose rule we need placeholders. Let’s take a look at what a placeholder looks like
As a reminder, here’s the previous version from the last episode:
PYTHON
rule hostname_remote:
output: "hostname_remote.txt"
input:
shell:
"hostname > hostname_remote.txt"
The new rule has replaced explicit file names with things in
{curly brackets}
, specifically {output}
(but
it could also have been {input}
…if that had a value and
were useful).
{input}
and {output}
are
placeholders
Placeholders are used in the shell
section of a rule,
and Snakemake will replace them with appropriate values -
{input}
with the full name of the input file, and
{output}
with the full name of the output file – before
running the command.
{resources}
is also a placeholder, and we can access a
named element of the {resources}
with the notation
{resources.runtime}
(for example).
Content from MPI applications and Snakemake
Last updated on 2024-02-01 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- “How do I run an MPI application via Snakemake on the cluster?”
Objectives
- “Define rules to run locally and on the cluster”
Now it’s time to start getting back to our real workflow. We can
execute a command on the cluster, but what about executing the MPI
application we are interested in? Our application is called
amdahl
and is available as an environment module.
will locate and then load the amdahl
module. We can then
update/replace our rule to run the amdahl
application:
However, when we try to execute the rule we get an error (unless you
already have a different version of amdahl
available in
your path). Snakemake reports the location of the logs and if we look
inside we can (eventually) find
OUTPUT
...
mpiexec -n 1 amdahl > amdahl_run.txt
--------------------------------------------------------------------------
mpiexec was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpiexec command
line parameter option (remember that mpiexec interprets the first
unrecognized command line token as the executable).
Node: tmpnode1
Executable: amdahl
--------------------------------------------------------------------------
...
So, even though we loaded the module before running the workflow, our Snakemake rule didn’t find the executable. That’s because the Snakemake rule is running in a clean runtime environment, and we need to somehow tell it to load the necessary environment module before trying to execute the rule.
Snakemake and environment modules
Our application is called amdahl
and is available on the
system via an environment module, so we need to tell Snakemake to load
the module before it tries to execute the rule. Snakemake is aware of
environment modules, and these can be specified via (yet another)
option:
PYTHON
rule amdahl_run:
output: "amdahl_run.txt"
input:
envmodules:
"mpi4py",
"amdahl"
input:
shell:
"mpiexec -n 1 amdahl > {output}"
Adding these lines are not enough to make the rule execute however. Not only do you have to tell Snakemake what modules to load, but you also have to tell it to use environment modules in general (since the use of environment modules is considered to make your runtime environment less reproducible as the available modules may differ from cluster to cluster). This requires you to give Snakemake an additonal option
Snakemake and MPI
We didn’t really run an MPI application in the last section as we only ran on one core. How do we request to run on multiple cores for a single rule?
Snakemake has general support for MPI, but the only executor that currently explicitly supports MPI is the Slurm executor (lucky for us!). If we look back at our Slurm to Snakemake translation table we notice the relevant options appear near the bottom:
SLURM | Snakemake | Description |
---|---|---|
… | … | … |
--ntasks |
tasks |
number of concurrent tasks / ranks |
--cpus-per-task |
cpus_per_task |
number of cpus per task (in case of SMP, rather use
threads ) |
--nodes |
nodes |
number of nodes |
The one we are interested is tasks
as we are only going
to increase the number of ranks. We can define these in a
resources
section of our rule and refer to them using
placeholders:
PYTHON
rule amdahl_run:
output: "amdahl_run.txt"
input:
envmodules:
"amdahl"
resources:
mpi='mpiexec',
tasks=2
input:
shell:
"{resources.mpi} -n {resources.tasks} amdahl > {output}"
That worked but now we have a bit of an issue. We want to do this for
a few different values of tasks
that would mean we would
need a different output file for every run. It would be great if we can
somehow indicate in the output
the value that we want to
use for tasks
…and have Snakemake pick that up.
We could use a wildcard in the output
to allow
us to define the tasks
we wish to use. The syntax for such
a wildcard looks like
where parallel_tasks
is our wildcard.
Wildcards
Wildcards are used in the input
and output
lines of the rule to represent parts of filenames. Much like the
*
pattern in the shell, the wildcard can stand in for any
text in order to make up the desired filename. As with naming your
rules, you may choose any name you like for your wildcards, so here we
used parallel_tasks
. Using the same wildcards in the input
and output is what tells Snakemake how to match input files to output
files.
If two rules use a wildcard with the same name then Snakemake will treat them as different entities - rules in Snakemake are self-contained in this way.
In the shell
line you can reference the wildcard with
{wildcards.parallel_tasks}
Snakemake order of operations
We’re only just getting started with some simple rules, but it’s worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases:
- Prepares to run:
- Reads in all the rule definitions from the Snakefile
- Plans what to do:
- Sees what file(s) you are asking it to make
- Looks for a matching rule by looking at the
output
s of all the rules it knows - Fills in the wildcards to work out the
input
for this rule - Checks that this input file (if required) is actually available
- Runs the steps:
- Creates the directory for the output file, if needed
- Removes the old output file if it is already there
- Only then, runs the shell command with the placeholders replaced
- Checks that the command ran without errors and made the new output file as expected
The amount of checking may seem pedantic right now, but as the workflow gains more steps this will become very useful to us indeed.
Using wildcards in our rule
We would like to use a wildcard in the output
to allow
us to define the number of tasks
we wish to use. Based on
what we’ve seen so far, you might imagine this could look like
PYTHON
rule amdahl_run:
output: "amdahl_run_{parallel_tasks}.txt"
input:
envmodules:
"amdahl"
resources:
mpi="mpiexec",
tasks="{parallel_tasks}"
input:
shell:
"{resources.mpi} -n {resources.tasks} amdahl > {output}"
but there are two problems with this:
-
The only way for Snakemake to know the value of the wildcard is for the user to explicitly request a concrete output file (rather than call the rule):
This is perfectly valid, as Snakemake can figure out that it has a rule that can match that filename.
-
The bigger problem is that even doing that does not work, it seems we cannot use a wildcard for
tasks
:OUTPUT
WorkflowError: SLURM job submission failed. The error message was sbatch: error: Invalid numeric value "{parallel_tasks}" for --ntasks.
Unfortunately for us, there is no direct way for us to access the
wildcards for tasks
. The reason for this is that Snakemake
tries to use the value of tasks
during it’s initialisation
stage, which is before we know the value of the wildcard. We need to
defer the determination of tasks
to later on. This can be
achieved by specifying an input function instead of a value for this
scenario. The solution then is to write a one-time use function to
manipulate Snakemake into doing this for us. Since the function is
specifically for the rule, we can use a one-line function without a
name. These kinds of functions are called either anonymous functions or
lamdba functions (both mean the same thing), and are a feature of Python
(and other programming languages).
To define a lambda function in python, the general syntax is as follows:
Since our function can take the wildcards as arguments, we
can use that to set the value for tasks
:
PYTHON
rule amdahl_run:
output: "amdahl_run_{parallel_tasks}.txt"
input:
envmodules:
"amdahl"
resources:
mpi="mpiexec",
# No direct way to access the wildcard in tasks, so we need to do this
# indirectly by declaring a short function that takes the wildcards as an
# argument
tasks=lambda wildcards: int(wildcards.parallel_tasks)
input:
shell:
"{resources.mpi} -n {resources.tasks} amdahl > {output}"
Now we have a rule that can be used to generate output from runs of an arbitrary number of parallel tasks.
Since our rule is now capable of generating an arbitrary number of
output files things could get very crowded in our current directory.
It’s probably best then to put the runs into a separate folder to keep
things tidy. We can add the folder directly to our output
and Snakemake will take of directory creation for us:
PYTHON
rule amdahl_run:
output: "runs/amdahl_run_{parallel_tasks}.txt"
input:
envmodules:
"amdahl"
resources:
mpi="mpiexec",
# No direct way to access the wildcard in tasks, so we need to do this
# indirectly by declaring a short function that takes the wildcards as an
# argument
tasks=lambda wildcards: int(wildcards.parallel_tasks)
input:
shell:
"{resources.mpi} -n {resources.tasks} amdahl > {output}"
Another thing about our application amdahl
is that we
ultimately want to process the output to generate our scaling plot. The
output right now is useful for reading but makes processing harder.
amdahl
has an option that actually makes this easier for
us. To see the amdahl
options we can use
OUTPUT
usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e]
options:
-h, --help show this help message and exit
-p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION]
Parallel proportion should be a float between 0 and 1
-w [WORK_SECONDS], --work-seconds [WORK_SECONDS]
Total seconds of workload, should be an integer greater than 0
-t, --terse Enable terse output
-e, --exact Disable random jitter
The option we are looking for is --terse
, and that will
make amdahl
print output in a format that is much easier to
process, JSON. JSON format in a file typically uses the file extension
.json
so let’s add that option to our shell
command and change the file format of the output
to match our new command:
PYTHON
rule amdahl_run:
output: "runs/amdahl_run_{parallel_tasks}.json"
input:
envmodules:
"amdahl"
resources:
mpi="mpiexec",
# No direct way to access the wildcard in tasks, so we need to do this
# indirectly by declaring a short function that takes the wildcards as an
# argument
tasks=lambda wildcards: int(wildcards.parallel_tasks)
input:
shell:
"{resources.mpi} -n {resources.tasks} amdahl --terse > {output}"
There was another parameter for amdahl
that caught my
eye. amdahl
has an option
--parallel-proportion
(or -p
) which we might
be interested in changing as it changes the behaviour of the code,and
therefore has an impact on the values we get in our results. Let’s add
another directory layer to our output format to reflect a particular
choice for this value. We can use a wildcard so we done have to choose
the value right away:
PYTHON
rule amdahl_run:
output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json"
input:
envmodules:
"amdahl"
resources:
mpi="mpiexec",
# No direct way to access the wildcard in tasks, so we need to do this
# indirectly by declaring a short function that takes the wildcards as an
# argument
tasks=lambda wildcards: int(wildcards.parallel_tasks)
input:
shell:
"{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}"
Content from Chaining rules
Last updated on 2024-02-01 | Edit this page
Estimated time: 70 minutes
Overview
Questions
- “How do I combine rules into a workflow?”
- “How do I make a rule with multiple inputs and outputs?”
Objectives
- “”
A pipeline of multiple rules
We now have a rule that can generate output for any value of
-p
and any number of tasks, we just need to call Snakemake
with the parameters that we want:
That’s not exactly convenient though, to generate a full dataset we have to run Snakemake lots of times with different output file targets. Rather than that, let’s create a rule that can generate those files for us.
Chaining rules in Snakemake is a matter of choosing filename patterns that connect the rules. There’s something of an art to it - most times there are several options that will work:
PYTHON
rule generate_run_files:
output: "p_{parallel_proportion}_runs.txt"
input: "p_{parallel_proportion}/runs/amdahl_run_6.json"
shell:
"echo {input} done > {output}"
Now let’s run the new rule (remember we need to request the output
file by name as the output
in our rule contains a wildcard
pattern):
OUTPUT
Using profile cluster_profile/ for setting default command line arguments.
Building DAG of jobs...
Retrieving input from storage.
Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
Provided remote nodes: 3
Job stats:
job count
------------------ -------
amdahl_run 1
generate_run_files 1
total 2
Select jobs to execute...
Execute 1 jobs...
[Tue Jan 30 17:39:29 2024]
rule amdahl_run:
output: p_0.999/runs/amdahl_run_6.json
jobid: 1
reason: Missing output files: p_0.999/runs/amdahl_run_6.json
wildcards: parallel_proportion=0.999, parallel_tasks=6
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, mem_mb_per_cpu=3600, runtime=2, mpi=mpiexec, tasks=6
mpiexec -n 6 amdahl --terse -p 0.999 > p_0.999/runs/amdahl_run_6.json
No SLURM account given, trying to guess.
Guessed SLURM account: def-users
Job 1 has been submitted with SLURM jobid 342 (log: /home/ocaisa/.snakemake/slurm_logs/rule_amdahl_run/342.log).
[Tue Jan 30 17:47:31 2024]
Finished job 1.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Jan 30 17:47:31 2024]
localrule generate_run_files:
input: p_0.999/runs/amdahl_run_6.json
output: p_0.999_runs.txt
jobid: 0
reason: Missing output files: p_0.999_runs.txt; Input files updated by another job: p_0.999/runs/amdahl_run_6.json
wildcards: parallel_proportion=0.999
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, mem_mb_per_cpu=3600, runtime=2
echo p_0.999/runs/amdahl_run_6.json done > p_0.999_runs.txt
[Tue Jan 30 17:47:31 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-01-30T173929.781106.snakemake.log
Look at the logging messages that Snakemake prints in the terminal. What has happened here?
- Snakemake looks for a rule to make
p_0.999_runs.txt
- It determines that “generate_run_files” can make this if
parallel_proportion=0.999
- It sees that the input needed is therefore
p_0.999/runs/amdahl_run_6.json
- Snakemake looks for a rule to make
p_0.999/runs/amdahl_run_6.json
- It determines that “amdahl_run” can make this if
parallel_proportion=0.999
andparallel_tasks=6
- Now Snakemake has reached an available input file (in this case, no input file is actually required), it runs both steps to get the final output
This, in a nutshell, is how we build workflows in Snakemake.
- Define rules for all the processing steps
- Choose
input
andoutput
naming patterns that allow Snakemake to link the rules - Tell Snakemake to generate the final output file(s)
If you are used to writing regular scripts this takes a little getting used to. Rather than listing steps in order of execution, you are alway working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.
Outputs first?
The Snakemake approach of working backwards from the desired output
to determine the workflow is why we’re putting the output
lines first in all our rules - to remind us that these are what
Snakemake looks at first!
Many users of Snakemake, and indeed the official documentation,
prefer to have the input
first, so in practice you should
use whatever order makes sense to you.
log
outputs in Snakemake
Snakemake has a dedicated rule field for outputs that are log
files, and these are mostly treated as regular outputs except that
log files are not removed if the job produces an error. This means you
can look at the log to help diagnose the error. In a real workflow this
can be very useful, but in terms of learning the fundamentals of
Snakemake we’ll stick with regular input
and
output
fields here.
Errors are normal
Don’t be disheartened if you see errors when first testing your new Snakemake pipelines. There is a lot that can go wrong when writing a new workflow, and you’ll normally need several iterations to get things just right. One advantage of the Snakemake approach compared to regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on and potentially running junk calculations on partial or corrupted data. Another advantage is that when a step fails we can safely resume from where we left off.
Content from Processing lists of inputs
Last updated on 2024-02-01 | Edit this page
Estimated time: 80 minutes
Overview
Questions
- “How do I process multiple files at once?”
- “How do I combine multiple files together?”
Objectives
- “Use Snakemake to process all our samples at once”
- “Make a scalability plot that brings our results together”
We created a rule that can generate a single output file, but we’re not going to create multiple rules for every output file. We want to generate all of the run files with a single rule if we could, well Snakemake can indeed take a list of input files:
PYTHON
rule generate_run_files:
output: "p_{parallel_proportion}_runs.txt"
input: "p_{parallel_proportion}/runs/amdahl_run_2.json", "p_{parallel_proportion}/runs/amdahl_run_6.json"
shell:
"echo {input} done > {output}"
That’s great, but we don’t want to have to list all of the files we’re interested in individually. How can we do this?
Defining a list of samples to process
To do this, we can define some lists as Snakemake global variables.
Global variables should be added before the rules in the Snakefile.
- Unlike with variables in shell scripts, we can put spaces around the
=
sign, but they are not mandatory. - The lists of quoted strings are enclosed in square brackets and comma-separated. If you know any Python you’ll recognise this as Python list syntax.
- A good convention is to use capitalized names for these variables, but this is not mandatory.
- Although these are referred to as variables, you can’t actually change the values once the workflow is running, so lists defined this way are more like constants.
Using a Snakemake rule to define a batch of outputs
Now let’s update our Snakefile to leverage the new global variable and create a list of files:
PYTHON
rule generate_run_files:
output: "p_{parallel_proportion}_runs.txt"
input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES)
shell:
"echo {input} done > {output}"
The expand(...)
function in this rule generates a list
of filenames, by taking the first thing in the single parentheses as a
template and replacing {count}
with all the
NTASK_SIZES
. Since there are 5 elements in the list, this
will yield 5 files we want to make. Note that we had to protect our
wildcard in a second set of parentheses so it wouldn’t be interpreted as
something that needed to be expanded.
In our current case we still rely on the file name to define the
value of the wildcard parallel_proportion
so we can’t call
the rule directly, we still need to request a specific file:
If you don’t specify a target rule name or any file names on the command line when running Snakemake, the default is to use the first rule in the Snakefile as the target.
Rules as targets
Giving the name of a rule to Snakemake on the command line only works when that rule has no wildcards in the outputs, because Snakemake has no way to know what the desired wildcards might be. You will see the error “Target rules may not contain wildcards.” This can also happen when you don’t supply any explicit targets on the command line at all, and Snakemake tries to runthe first rule defined in the Snakefile.
Rules that combine multiple inputs
Our generate_run_files
rule is a rule which takes a list
of input files. The length of that list is not fixed by the rule, but
can change based on NTASK_SIZES
.
In our workflow the final step is to take all the generated files and
combine them into a plot. To do that, you may have heard that some
people use a python library called matplotlib
. It’s beyond
the scope of this tutorial to write the python script to create a final
plot, so we provide you with the script as part of this lesson. You can
download it with
The script plot_terse_amdahl_results.py
needs a command
line that looks like:
BASH
python plot_terse_amdahl_results.py <output jpeg filename> <1st input file> <2nd input file> ...
Let’s introduce that into our generate_run_files
rule:
PYTHON
rule generate_run_files:
output: "p_{parallel_proportion}_runs.txt"
input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES)
shell:
"python plot_terse_amdahl_results.py {output} {input}"
Now we finally get to generate a scaling plot! Run the final Snakemake command
Comments in Snakefiles
In the above code, the line beginning
#
is a comment line. Hopefully you are already in the habit of adding comments to your own scripts. Good comments make any script more readable, and this is just as true with Snakefiles.