Use “timeout” command to automatically restart jobs

Sometimes our jobs cannot be finished in the 24-hour time limit, and have to be restarted again and again until the calculations are completed successfully. Instead of manually checking the job states and resubmitting the jobscripts, we can use the linux command “timeout” to restart jobs automatically.

For example, if you have a job that will need to run more than 24 hours (suppose the job is regularly checkpointed and can be restarted by just resubmitting the job), you can prepare a Slurm jobscript “my_slurm_jobscript” like this:

#SBATCH --partition=workq
#SBATCH --nodes=1
#SBATCH --time=24:00:00
timeout 23h srun --ntasks=32 my_app
if [[ $? -eq 124 ]]; then
  sbatch my_slurm_jobscript

In this job, “my_app” will be running for 23 hours at the most. If “my_app” is stopped by the “timeout” command, an exit code “124” will be returned and the jobscript “my_slurm_jobscript” will be automatically submitted again. Then the jobscrtipt “my_slurm_jobscript” will be repeatedly resubmitted until the calculation is fully completed.

Please note:
#1 The Slurm time limit (24 hours in this case) should be a little bit longer than the “timeout” time limit (23 hours in this case), so that the “sbatch my_slurm_jobscript” will be able to executed.
#2 Do a test to make sure this works for you own case before using it massively in your production runs.
#3 You still need to monitor your jobs regularly to make sure everything is working fine, as not all exceptions can be covered in one simple script.

For more information about the “timeout” command, use “man timeout”.