Gathering Information

Slurm offers many commands you can use to interact with the system. For instance, the sinfo command gives an overview of the resources offered by the cluster, while the squeue command shows to which jobs those resources are currently allocated.

By default, sinfo lists the partitions that are available. A partition is a set of compute nodes grouped logically. Typical examples include partitions dedicated to batch processing or debugging.

# sinfo

batch*       up 120-00:00:              2  drain* dbn306-31-r,kccn708-28-17
batch*       up 120-00:00:           158  down* cbrcs712-[21,23],db501-04-[4-7], ...
batch*       up 120-00:00:              1   idle*   dgpu501-22-r
batch*       up 120-00:00:              2   comp db809-22-5,kccn708-28-07
batch*       up 120-00:00:           126  drng   db508-01-1,db508-11-8,db512-11-7, ...
batch*       up 120-00:00:             11  drain  db510-21-3,db513-11-2,db803-22-2, ...
batch*       up 120-00:00:           235    mix  besest711-18,db508-01-[2-6,8], ...
batch*       up 120-00:00:             35   resv cbrcs712-19,db510-01-6,db512-01-3, ...
batch*       up 120-00:00:           275  alloc db508-21-[3-4,6],db509-01-4,db510-01-1, ...
batch*       up 120-00:00:             14   idle db813-12-8,dgpu501-22-l,dgpu501-26-r, ...
debug        up    1:00:00                4   idle dgpu703-29,sdb712-01-[01-03]

In the above example, we see two partitions, named batch and debug. The former is the default partition as it is marked with an asterisk. 15 nodes of the batch partition are idle, while 275 are being used (alloc state).

The squeue command shows the list of jobs which are currently running, they are in the RUNNING state (noted as 'R') or waiting for resources (noted as 'PD').

# squeue
12345     batch job1 dave  R   0:21     1 cn605-14-r
12346     batch job2 dave PD   0:00     8 (Resources)
12348     batch job3 ed   PD   0:00     4 (Priority)

The above output shows that one job is running, whose name is job1 and whose jobid is 12345. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to cancel this job, you would use scancel 12345.

TIME is the time the job has been running until now.

NODES is the number of nodes which are allocated to the job, while the NODELIST column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending. In the example, job 12346 is pending because resources (CPUs, or other) are not available in sufficient amounts, while job 12348 is waiting for another job, whose priority is higher, to run. Each job is indeed assigned a priority depending on several parameters whose details are beyond the scope of this document. Note that the priority for pending jobs can be obtained with the sprio command.

There are many switches you can use to filter the output by user (--user), by partition (--partition) by state (--state), etc. As with the sinfo command, you can choose what you want sprio to output with the --format parameter.