Create and run a job

Creating a job

A job consists of two parts:resource requests and job steps. Resource requests consist of a number of CPUs, computing expected duration, amounts of RAM or disk space, etc. Job steps describe tasks that must be done, software which must be run.

The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script, whose comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage (man sbatch)

The script itself is a job step. Other job steps are created with the command to be run.

For instance, the following script, hypothetically named submit.sh,

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --output=test.err
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
hostname
sleep 60

would request one CPU for 10 minutes, along with 100 MB of RAM, in the default queue.. When started, the job would run a first job step hostname, which will launch the UNIX command hostname on the node on which the requested CPU was allocated. Then, a second jobs step will start the sleep command. Interestingly, you can get near-realtime information about your program (memory consumption, etc.) with the sstat command (sstat <job-id>). Note that the --job-name parameter allows giving a meaningful name to the job and the --output parameter defines the file to which the output of the job must be sent.

Submitting

Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)

$ sbatch submit.sh
sbatch: Submitted batch job 99999999

The job then enters the queue in the PENDING state. Once resources become available and the job has highest priority, an allocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.

Upon completion, the output file contains the result of the commands run in the script file. In the above example, you can see it with cat test.out.

Choosing the right target

One might want to run his/her job on a defined processor (Intel, AMD or GPU), or on a node with a certain amount of local storage or memory.

This can be achieved by enforcing constraints at the job submission step. For example, to run exclusively on intel node:

$ sbatch --constraint=intel submit.sh
sbatch: Submitted batch job 99999999

more information on this can be found here

Jobs scheduling limits:        

 The following limits have now been applied to the default QOS in the batch partition:                                               
                                                                    
    Max Jobs Per User = 2000                                        
                                                                    
      This restricts a user from having more than 2000 jobs         
      (in any state). Jobs in excess of this limit will not be      
      accepted. Once the number of jobs in the queue drops below    
      this limit then jobs will be accpted.                         
                                                                    
    Max CPU Cores Per User = 1024                                   
                                                                    
      This restricts a user from using (at any time) more than 1024
      CPU cores. When the limit is reached, no more jobs will be
      scheduled until the number of used cores drops below this limit.
      When the limit is reached by the user, the reason the jobs are
      not being started will be shown by SLURM as "association resource
      limit".