Advance capture of the termination of a SLURM Job

Did you know that you could ask SLURM to ping you a few minutes before it terminates your job? To do so, add the following configuration parameter in the head line of your job:

#SBATCH --signal=15@25

Fifteen seconds before the job, a signal will be issued on the first node of the job that can be captured by your C program as shown in this code example:

#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define ITERATION_TIME 10000

volatile sig_atomic_t done = 0;
 
void term(int signum)
{
  done = 1;

  printf("\n\nTerm signal received.... \n ");
  printf("\n\nDoing clean termination tasks.... \n ");
  fflush(stdout);
}

int main(int argc, char *argv[])
{
    struct sigaction action;
    memset(&action, 0, sizeof(struct sigaction));
    action.sa_handler = term;
    sigaction(SIGTERM, &action, NULL);
 
    while (!done)
    {
        printf("\nStill running...");
        fflush(stdout);
        int t = sleep(ITERATION_TIME);
    }

 
    printf("\ndone.");
    return 0;
}

In the job script, add the --signal line, and call your program as usual.

#!/bin/bash
#SBATCH --time=00:02:00 
#SBATCH --job-name=prog
#SBATCH --output=job.out
#SBATCH --ntasks=1 
#SBATCH --signal=15@25

srun -n 1  ./my_prog

for i in 1 2 3 4 5 6 7 8 9 10; do
   date
   echo waiting to be killed....
   sleep 10
done

exit 0

When submitting this job, the term signal is correctly trapped by the C program, which terminates cleanly as the example code output shows below:

slurmstepd: error: *** STEP 19499642.0 ON nid00476 CANCELLED AT 2021-04-01T15:07:20 ***
Still running...

Term signal received.... 

Doing clean termination tasks.... 
 
done.Thu Apr  1 15:07:21 +03 2021
waiitng to be killed....
Thu Apr  1 15:07:31 +03 2021
waiitng to be killed....
Thu Apr  1 15:07:41 +03 2021
waiitng to be killed....
Thu Apr  1 15:07:51 +03 2021
waiitng to be killed....
Thu Apr  1 15:08:01 +03 2021
waiitng to be killed....
Thu Apr  1 15:08:11 +03 2021
waiitng to be killed....
slurmstepd: error: *** JOB 19499642 ON nid00476 CANCELLED AT 2021-04-01T15:08:20 DUE TO TIME LIMIT ***