Slurm, sbatch, the job queue

Problem: 1000 users, 500 nodes, 10k cores
- Need a queue:

x-axis: cores, one thread per core
y-axis: time

We use the SLURM tool to handle the queue
plan your job and but in the slurm job batch (sbatch)
- sbatch <flags> <program> or
- sbatch <job script>
Easiest to schedule single-threaded, short jobs

Left: 4 one-core jobs can run immediately (or a 4-core wide job).

The jobs are too long to fit in core number 9-13.

Right: A 5-core job has to wait.

Too long to fit in cores 9-13 and too wide to fit in the last cores.

Jobs

Job = what will happen during booked time
Described in a Bash script file
- Slurm parameters (flags)
- Load software modules
- (Move around file system)
- Run programs
- (Collect output)
... and more

Slurm parameters

We present the basics here. On Wednesday afternoon we go deeper!

Most important slurm flags

1 mandatory setting for jobs:
- Which compute project? (-A)
3 settings you really should set:
- Type of queue? (-p)
- How many cores? (-n)
- How long at most? (-t)
If in doubt:
- -p core
- -n 1
- -t 7-00:00:00

Type of queue -p

Where should it run?
- several nodes: -p node
- within one node: -p core
Use a whole node or just part of it?
- 1 node = 20 cores (16 on Bianca & Snowy)
- 1 hour walltime = 20 core hours = rather expensive
- Waste of resources unless you have a parallel program or need all the memory, e.g. 128 GB per node
Short test jobs (< 1 hour)
- -p devcore (within one node)
- -p devel (across nodes)
Default value: core

How long is the job? -t

Jobs killed when timelimit reached
Only charged for time used
-t = time (hh:mm:ss)
- 78:00:00 or 3-6:00:00
Default value: 7-00:00:00
Always do an overistimation by ~50%.
- things can take longer sometimes

Efficient jobs

Use your booked cores or memory
- (at least 50%)

Runtime longer than 1 hour
- Combine shorter jobs if possible
Ask UPPMAX support for help!

Interactive jobs

Most work is most effective as submitted jobs, but e.g. development needs responsiveness.
Interactive jobs are high-priority but limited in -n and -t
Quickly gives you a job and logs you in to the compute node
Requires same Slurm parameters as other jobs
Try it:

$ interactive -A snic2022-22-50 -p core -n 1 -t 10:00

Which node are you on?
- Logout with Ctrl-D or 'exit'

A simple job script template

#!/bin/bash -l 
#This tells it is bash language and -l is for starting a session with a "clean environment, e.g. with no modules loaded and paths reset"

#SBATCH -A snic2022-22-50  # Project name

#SBATCH -p devcore  # Asking for cores (for test jobs and as opposed to multiple nodes) 

#SBATCH -n 1  # Number of cores

#SBATCH -t 00:10:00  # Ten minutes

#SBATCH -J Template_script  # Name of the job

# go to some directory

cd /proj/introtouppmax/completed/
pwd
echo ""
ls -l
echo ""
# load software modules

module load bioinfo-tools
module list

# do something

echo Hello world!  
echo ""

Other Slurm tools

Squeue — quick info about jobs in queue
Jobinfo — detailed info about jobs
Finishedjobinfo — summary of finished jobs
Jobstats — efficiency of booked resources
Slurm at UPPMAX user guide

More on Wednesday!

Exercise: Submitting a job

Copy the just further up!

Put it into a file named “jobtemplate.sh”
Make the file executable (chmod)
Submit the job:

$ sbatch jobtemplate.sh

Note the job id!
Check the queue:

$ squeue -u <username>
$ jobinfo -u <username>

When it’s done, look for the output file (slurm-<jobid>.out):

$ ls -lrt

Check the output file to see if it ran correctly

$ cat <filename>