KAUST Supercomputing Laboratory Newsletter 1st April
In this newsletter:
- Data Centre and Shaheen Maintenance, 22-29 April 2021
- Saudi HPC/AI 2021
- RCAC meeting
- Tip of the week: Advance capture of the termination of a SLURM Job
- Upcoming Training
- Follow us on Twitter
- Previous Announcements
- Previous Tips
Data Centre and Shaheen Maintenance, 22-29 April 2021
Our next maintenance session on Shaheen will take place from 15:30 on the 22nd April until 17:00 on the 29th April. The data centre team will be performing their annual PPM on the power supply equipment. At the same time, we will upgrade the software version and firmware on Shaheen and Neser’s existing Lustre (project and scratch) filesystem. This is an essential step before bringing our newly acquired filesystem online and providing more project storage space. As we are upgrading the Lustre filesystem there will be no access to data during this period.
Please contact us at help@hpc.kaust.edu.sa should you have any concerns or questions.
Saudi HPC/AI Conference 2021
RCAC meeting
The project submission deadline for the next RCAC meeting is 30th April 2021. Please note that the RCAC meetings are held once per month. Projects received on or before the submission deadline will be included in the agenda for the subsequent RCAC meeting.The detailed procedures, updated templates and forms are available here: https://www.hpc.kaust.edu.sa/account-applications
Tip of the week: Advance capture of the termination of a SLURM Job
Did you know that you could ask SLURM to ping you a few minutes before it terminates your job? To do so, add the following configuration parameter in the head line of your job:
#SBATCH --signal=15@25
Fifteen seconds before the job, a signal will be issued on the first node of the job that can be captured by your C program as shown in this code example:
#include <signal.h> #include <stdio.h> #include <string.h> #include <unistd.h> #define ITERATION_TIME 10000 volatile sig_atomic_t done = 0; void term(int signum) { done = 1; printf("\n\nTerm signal received.... \n "); printf("\n\nDoing clean termination tasks.... \n "); fflush(stdout); } int main(int argc, char *argv[]) { struct sigaction action; memset(&action, 0, sizeof(struct sigaction)); action.sa_handler = term; sigaction(SIGTERM, &action, NULL); while (!done) { printf("\nStill running..."); fflush(stdout); int t = sleep(ITERATION_TIME); } printf("\ndone."); return 0; }
In the job script, add the --signal line, and call your program as usual.
#!/bin/bash #SBATCH --time=00:02:00 #SBATCH --job-name=prog #SBATCH --output=job.out #SBATCH --ntasks=1 #SBATCH --signal=15@25 srun -n 1 ./my_prog for i in 1 2 3 4 5 6 7 8 9 10; do date echo waiting to be killed.... sleep 10 done exit 0
When submitting this job, the term signal is correctly trapped by the C program, which terminates cleanly as the example code output shows below:
slurmstepd: error: *** STEP 19499642.0 ON nid00476 CANCELLED AT 2021-04-01T15:07:20 *** Still running... Term signal received.... Doing clean termination tasks.... done.Thu Apr 1 15:07:21 +03 2021 waiitng to be killed.... Thu Apr 1 15:07:31 +03 2021 waiitng to be killed.... Thu Apr 1 15:07:41 +03 2021 waiitng to be killed.... Thu Apr 1 15:07:51 +03 2021 waiitng to be killed.... Thu Apr 1 15:08:01 +03 2021 waiitng to be killed.... Thu Apr 1 15:08:11 +03 2021 waiitng to be killed.... slurmstepd: error: *** JOB 19499642 ON nid00476 CANCELLED AT 2021-04-01T15:08:20 DUE TO TIME LIMIT ***
Upcoming training
Title: Best Practices for Distributed Deep Learning on Ibex GPUs
Date: April 11th, 1pm - 4pm
With increasing size and complexity of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trival, ranging from few tens of hours to even several days. Exploiting data parallelism exhibited inherently in the training process of DL models, we can distribute training on multiple GPUs on a single or multiple nodes of Ibex. We discuss and demonstrate the use of Horovod, a scalable distributed training framework, to train DL models on multiple GPUs on Ibex. Horovod integrates with Tensorflow 1 & 2, Pytorch and MxNet. We also discuss the some caveats to look for when using Horovod for large mini-batch training.
Follow us on Twitter
Follow all the latest news on HPC within the Supercomputing Lab and at KAUST, on Twitter @KAUST_HPC.
Previous Announcements
http://www.hpc.kaust.edu.sa/announcements/