Chapter 5: SLURM Job Scheduler | KAUST Supercomputing Lab

Resource sharing on a supercomputer dedicated to technical and/or scientific computing is often organized by a piece of software called a resource manager or job scheduler. Users submit jobs, which are scheduled and allocated resources (CPU time, memory, etc.) by the resource manager.

Slurm is a resource manager and job scheduler designed to do just that, and much more. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers.

Gathering information

Slurm offers many commands you can use to interact with the system. For instance, the sinfo command gives an overview of the resources offered by the cluster, while the squeue command shows to which jobs those resources are currently allocated.

By default, sinfo lists the partitions that are available. A partition is a set of compute nodes (computers dedicated to ... computing,) grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.

# sinfo
PARTITION   AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE     
workq*      up    1-infini 1-00:00:00     64 2:16:2       1 drained*  
workq*      up    1-infini 1-00:00:00     64 2:16:2       1 down*     
workq*      up    1-infini 1-00:00:00     64 2:16:2      15 drained   
workq*      up    1-infini 1-00:00:00     64 2:16:2       7 reserved  
workq*      up    1-infini 1-00:00:00     64 2:16:2    6145 allocated
workq*      up    1-infini 1-00:00:00     64 2:16:2       5 idle      
72hours     up    1        3-00:00:00     64 2:16:2       1 drained*  
72hours     up    1        3-00:00:00     64 2:16:2       1 down*     
72hours     up    1        3-00:00:00     64 2:16:2      15 drained   
72hours     up    1        3-00:00:00     64 2:16:2       7 reserved  
72hours     up    1        3-00:00:00     64 2:16:2    6145 allocated
72hours     up    1        3-00:00:00     64 2:16:2       5 idle

In the above example, we see two partitions, named workq and 72hours. The former is the default partition as it is marked with an asterisk. 5 nodes of the workq partition are idle, while 6145 are being used.
The squeue command shows the list of jobs which are currently running (they are in the RUNNING state, noted as 'R') or waiting for resources (noted as 'PD').

# squeue
JOBID PARTITION NAME USER ST  TIME  NODES NODELIST(REASON)
12345     workq job1 dave  R   0:21     4 nid[1779-1782]
12346     workq job2 dave PD   0:00     8 (Resources)
12348     workq job3 ed   PD   0:00     4 (Priority)

The above output show that is one job running, whose name is job1 and whose jobid is 12345. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to cancel job job1, you would use scancel 12345. Time is the time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending. In the example, job 12346 is pending because resources (CPUs, or other) are not available in sufficient amounts, while job 12348 is waiting for job 12346, whose priority is higher, to run. Each job is indeed assigned a priority depending on several parameters whose details are beyond the scope of this document. Note that the priority for pending jobs can be obtained with the sprio command.

There are many switches you can use to filter the output by user (--user), by partition (--partition) by state (--state), etc. As with the sinfo command, you can choose what you want sprio to output with the --format parameter.

Creating a job

Now the question is: How do you create a job?

A job consists in two parts:resource requests and job steps. Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM or disk space, etc. Job steps describe tasks that must be done, software which must be run.

The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script, whose comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage (man sbatch)

The script itself is a job step. Other job steps are created with the srun command.

For instance, the following script, hypothetically named submit.sh,

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH --partition=workq
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun hostname
srun sleep 60

would request one CPU for 10 minutes, along with 100 MB of RAM, in the default queue.. When started, the job would run a first job step srun hostname, which will launch the UNIX command hostname on the node on which the requested CPU was allocated. Then, a second jobs step will start the sleep command. Interestingly, you can get near-realtime information about your program (memory consumption, etc.) with the sstat command. Note that the --job-name parameter allows giving a meaningful name to the job and the --output parameter defines the file to which the output of the job must be sent.

Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)

$ sbatch ./submit.sh
sbatch: Submitted batch job 99999999

The job then enters the queue in the PENDING state. Once resources become available and the job has highest priority, an allocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.

Upon compeltion, the output file contains the result of the commands run in the script file. In the above example, you can see it with cat res.txt.

Going parallel

But, still, the real question is: How do you create a parallel job?

There are several ways a parallel job, one whose tasks are run simultaneously, can be created:

by running a multi-process program (SPMD paradigm, e.g. with MPI)
by running a multithreaded program (shared memory paradigm, e.g. with OpenMP or pthreads)
by running several instances of a single-threaded program (so-called embarrassingly parallel paradigm)
by running one master program controling several slave programs (master/slave paradigm)

In the Slurm context, a task is to be understood as a process. So a multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs.

Tasks are requested/created with the --ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the --cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the --ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.

Interactive jobs

The following command is used to "reserve" compute nodes and give interactive session on the gateway

cdl$ salloc

if interactive session is required directly on the compute node, the following command should be used

gateway$ srun -u --pty bash -i

More submission script examples

Here are some quick sample submission scripts. For more detailed information, make sure to have a look at the Slurm FAQ and to follow our training sessions

Belonging to more than one project

If you are a member of more than one project, you will also need to specify the account number to be used:

#SBATCH --account=kXXXX

Message passing example (MPI)

#!/bin/bash
#
#SBATCH --job-name=test_mpi
#SBATCH --output=res_mpi.txt
#SBATCH --partition=workq
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun hello.mpi

Request four cores on the cluster for 10 minutes, using 100 MB of RAM per core. Assuming my_mpi_program was compiled with MPI support, srun will create four instances of it, on the nodes allocated by Slurm.

You can try the above example by downloading the example hello world program from Wikipedia (name it for instance wiki_mpi_example.c), and compiling it with

cc wiki_mpi_example.c -o hello.mpi

The res_mpi.txt file should contain something like

0: We have 4 processors
0: Hello 1! Processor 1 reporting for duty

0: Hello 2! Processor 2 reporting for duty

0: Hello 3! Processor 3 reporting for duty

Shared memory example (OpenMP)

#!/bin/bash
#
#SBATCH --job-name=test_omp
#SBATCH --output=res_omp.txt
#SBATCH --partition=workq
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./hello.omp

The job it will be run in an allocation where four cores have been reserved on the same compute node.

You can try it by using the hello world program from Wikipedia (name it for instance wiki_omp_example.c) and compiling it with GNU compiler

module load PrgEnv-gnu

cc -fopenmp wiki_omp_example.c -o hello.omp

The resomp.txt file should contain something like

Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
Hello World from thread 2
There are 4 threads

Embarrassingly parallel workload example

#!/bin/bash
#
#SBATCH --job-name=test_emb
#SBATCH --output=res_emb.txt
#SBATCH --partition=workq
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun printenv SLURM_PROCID

In that configuration, the printenv command will be run four times, and each will have its environment variable SLURM_PROCID set to a distinct value.

This setup is useful if the program is based on random draws (e.g. Monte-Carlo simulations): the application permitting, you can have four programs drawing 1000 samples and combine their output (with another program) to get the equivalent of drawing 4000 samples.

Another typical use of this setting is parameter sweep where the same computation is carried on by each program except that some high-level parameter has distinct values in each case. Examples include optimisation of an integer-valued parameter through range scanning. In the latter case, each instance of the program simply has to lookup the $SLURM_PROCID environment variable and decide, accordingly, what values of the parameter to test.

The same can be set up to process several data files for instance. Each instance of the program just has to decide which file to read based upon the value set in its $SLURM_PROCID environment variable.

Upon completion, the above job will create a file test_emp with four lines:

Master/slave program example

#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=res_ms.txt
#SBATCH --partition=workq
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun --multi-prog multi.conf

With file multi.conf being, for example, as follows

0      echo     'I am the Master'
1-3    printenv SLURM_PROCID

The above instructs Slurm to create four tasks (or processes), one running my_master_program, and the other 3 running my_slave_program. This is typically used in a producer/consumer setup where one program (the master) create computing tasks for the other program (the slaves) to perform.

Upon completion of the above job, file res_ms.txt will contain

I am the Master
1
2
3

Queues

workq: This is the default queue, the maximum wall clock time for jobs is 24 hours. There is also a limit of 800 jobs per user.

72hours: There are 512 nodes available in this queue with the maximum wall clock of 72 hours. There is also a limit of 80 jobs per user in this queue. Use of the 72hours queue is restricted to projects that have applied and been approved by the RCAC. To use 72hours queue the following two lines need to be added to the job submission file:

#SBATCH --partition=72hours
#SBATCH --qos=72hours

debug: There are 16 nodes available in this queue with a maximum wall clock of 30 minutes and a maximum job size of 4 nodes.

#SBATCH --partition=debug

Large Memory Nodes

We have 4 nodes (nid000[32-35]) available with 256 GB of memory, jobs can be queued to these nodes by specifying a larger memory requirement for the job:

#SBATCH --mem=262144

Theses nodes are not available in the 72hours queue.