Running Multiple Parallel Jobs Simultaneously
On Shaheen, the compute nodes are exclusive, meaning that even when all the resources within a node are not utilized by a given job, another job will not have access to these resources. By default, multiple concurrent srun executions cannot share compute nodes under SLURM in the regular partition, so make sure that the total number of cores required fit on the number of nodes requested. In the following example, a total of 9 nodes are required. Notice the "&" at the end of each srun command. Also the "wait" command at the end of the script is very important. It makes sure that the batch job won't exit before all the simultaneous sruns are completed.
#!/bin/bash #SBATCH -N 9 #SBATCH -t 0:15:00 srun --hint=nomultithread -N 2 --ntasks=64 --ntasks-per-node=32 --ntasks-per-socket=16 ./my_exe_1 & srun --hint=nomultithread –N 3 --ntasks=96 --ntasks-per-node=32 --ntasks-per-socket=16 ./my_exe_2 & srun --hint=nomultithread -N 4 --ntasks=128 --ntasks-per-node=32 --ntasks-per-socket=16 ./my_exe_3 & wait
You can run sequentially multiple srun by removing the “&”.