Advance capture of the termination of a SLURM Job
Did you know that you could ask SLURM to ping you a few minutes before it terminates your job? To do so, add the following configuration parameter in the head line of your job:
#SBATCH --signal=15@25
Fifteen seconds before the job, a signal will be issued on the first node of the job that can be captured by your C program as shown in this code example:
#include <signal.h> #include <stdio.h> #include <string.h> #include <unistd.h> #define ITERATION_TIME 10000 volatile sig_atomic_t done = 0; void term(int signum) { done = 1; printf("\n\nTerm signal received.... \n "); printf("\n\nDoing clean termination tasks.... \n "); fflush(stdout); } int main(int argc, char *argv[]) { struct sigaction action; memset(&action, 0, sizeof(struct sigaction)); action.sa_handler = term; sigaction(SIGTERM, &action, NULL); while (!done) { printf("\nStill running..."); fflush(stdout); int t = sleep(ITERATION_TIME); } printf("\ndone."); return 0; }
In the job script, add the --signal line, and call your program as usual.
#!/bin/bash #SBATCH --time=00:02:00 #SBATCH --job-name=prog #SBATCH --output=job.out #SBATCH --ntasks=1 #SBATCH --signal=15@25 srun -n 1 ./my_prog for i in 1 2 3 4 5 6 7 8 9 10; do date echo waiting to be killed.... sleep 10 done exit 0
When submitting this job, the term signal is correctly trapped by the C program, which terminates cleanly as the example code output shows below:
slurmstepd: error: *** STEP 19499642.0 ON nid00476 CANCELLED AT 2021-04-01T15:07:20 *** Still running... Term signal received.... Doing clean termination tasks.... done.Thu Apr 1 15:07:21 +03 2021 waiitng to be killed.... Thu Apr 1 15:07:31 +03 2021 waiitng to be killed.... Thu Apr 1 15:07:41 +03 2021 waiitng to be killed.... Thu Apr 1 15:07:51 +03 2021 waiitng to be killed.... Thu Apr 1 15:08:01 +03 2021 waiitng to be killed.... Thu Apr 1 15:08:11 +03 2021 waiitng to be killed.... slurmstepd: error: *** JOB 19499642 ON nid00476 CANCELLED AT 2021-04-01T15:08:20 DUE TO TIME LIMIT ***