Running commands with Snakemake

Last updated on 2024-02-01 | Edit this page

Estimated time: 60 minutes

Overview

Questions

“How do I run a simple command with Snakemake?”

Objectives

“Create a Snakemake recipe (a Snakefile)”

What is the workflow I’m interested in?

In this lesson we will make an experiment that takes an application which runs in parallel and investigate it’s scalability. To do that we will need to gather data, in this case that means running the application multiple times with different numbers of CPU cores and recording the execution time. Once we’ve done that we need to create a visualisation of the data to see how it compares against the ideal case.

From the visualisation we can then decide at what scale it makes most sense to run the application at in production to maximise the use of our CPU allocation on the system.

We could do all of this manually, but there are useful tools to help us manage data analysis pipelines like we have in our experiment. Today we’ll learn about one of those: Snakemake.

In order to get started with Snakemake, let’s begin by taking a simple command and see how we can run that via Snakemake. Let’s choose the command hostname which prints out the name of the host where the command is executed:

BASH

[ocaisa@node1 ~]$ hostname

OUTPUT

node1.int.jetstream2.hpc-carpentry.org

That prints out the result but Snakemake relies on files to know the status of your workflow, so let’s redirect the output to a file:

BASH

[ocaisa@node1 ~]$ hostname > hostname_login.txt

Making a Snakefile

Edit a new text file named Snakefile.

Contents of Snakefile:

PYTHON

rule hostname_login:
    output: "hostname_login.txt"
    input:  
    shell:
        "hostname > hostname_login.txt"

Key points about this file

The file is named Snakefile - with a capital S and no file extension.
Some lines are indented. Indents must be with space characters, not tabs. See the setup section for how to make your text editor do this.
The rule definition starts with the keyword rule followed by the rule name, then a colon.
We named the rule hostname_login. You may use letters, numbers or underscores, but the rule name must begin with a letter and may not be a keyword.
The keywords input, output, shell are all followed by a colon.
The file names and the shell command are all in "quotes".
The output filename is given before the input filename. In fact, Snakemake doesn’t care what order they appear in but we give the output first throughout this course. We’ll see why soon.
In this use case there is no input file for the command so we leave this blank.

Back in the shell we’ll run our new rule. At this point, if there were any missing quotes, bad indents, etc. we may see an error.

BASH

$ snakemake -j1 -p hostname_login

`bash: snakemake: command not found...`

If your shell tells you that it cannot find the command snakemake then we need to make the software available somehow. In our case, this means searching for the module that we need to load:

BASH

module spider snakemake

OUTPUT

[ocaisa@node1 ~]$ module spider snakemake

--------------------------------------------------------------------------------------------------------
  snakemake:
--------------------------------------------------------------------------------------------------------
     Versions:
        snakemake/8.2.1-foss-2023a
        snakemake/8.2.1 (E)

Names marked by a trailing (E) are extensions provided by another module.


--------------------------------------------------------------------------------------------------------
  For detailed information about a specific "snakemake" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider snakemake/8.2.1
--------------------------------------------------------------------------------------------------------

Now we want the module, so let’s load that to make the package available

BASH

[ocaisa@node1 ~]$ module load snakemake

and then make sure we have the snakemake command available

BASH

[ocaisa@node1 ~]$ which snakemake

OUTPUT

/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/snakemake/8.2.1-foss-2023a/bin/snakemake

Running Snakemake

Run snakemake --help | less to see the help for all available options. What does the -p option in the snakemake command above do?

Protects existing output files
Prints the shell commands that are being run to the terminal
Tells Snakemake to only run one process at a time
Prompts the user for the correct input file

Hint: you can search in the text by pressing /, and quit back to the shell with q

Show me the solution

Prints the shell commands that are being run to the terminal

This is such a useful thing we don’t know why it isn’t the default! The -j1 option is what tells Snakemake to only run one process at a time, and we’ll stick with this for now as it makes things simpler. Answer 4 is a total red-herring, as Snakemake never prompts interactively for user input.

Key Points

“Before running Snakemake you need to write a Snakefile”
“A Snakefile is a text file which defines a list of rules”
“Rules have inputs, outputs, and shell commands to be run”
“You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule”