KAUST Supercomputing Laboratory Newsletter 1st April

In this newsletter:

  • Data Centre and Shaheen Maintenance, 22-29 April 2021
  • Saudi HPC/AI 2021
  • RCAC meeting
  • Tip of the week: Advance capture of the termination of a SLURM Job
  • Upcoming Training
  • Follow us on Twitter
  • Previous Announcements
  • Previous Tips


Data Centre and Shaheen Maintenance, 22-29 April 2021

Our next maintenance session on Shaheen will take place from 15:30 on the 22nd April until 17:00 on the 29th April. The data centre team will be performing their annual PPM on the power supply equipment. At the same time, we will upgrade the software version and firmware on Shaheen and Neser’s existing Lustre (project and scratch) filesystem. This is an essential step before bringing our newly acquired filesystem online and providing more project storage space. As we are upgrading the Lustre filesystem there will be no access to data during this period.

Please contact us at help@hpc.kaust.edu.sa  should you have any concerns or questions.


Saudi HPC/AI Conference 2021

It is our pleasure to invite you to the Annual Saudi HPC/AI Conference 2021.
The conference is organized by a committee that includes representatives from many global and local HPC/AI players and will take place virtually on April 7-8, 2021.
This is the 10th edition of the conference. In all its editions, the conference has been the premier regional HPC/AI event where local and international HPC/AI users, experts, and technology providers have been socializing, exchanging business ideas and experiences.
Please use the following link for registration.
We look forward to welcoming you at the Saudi HPC/AI Conference 2021

RCAC meeting

The project submission deadline for the next RCAC meeting is 30th April 2021. Please note that the RCAC meetings are held once per month. Projects received on or before the submission deadline will be included in the agenda for the subsequent RCAC meeting.The detailed procedures, updated templates and forms are available here: https://www.hpc.kaust.edu.sa/account-applications


Tip of the week: Advance capture of the termination of a SLURM Job

Did you know that you could ask SLURM to ping you a few minutes before it terminates your job? To do so, add the following configuration parameter in the head line of your job:

#SBATCH --signal=15@25

Fifteen seconds before the job, a signal will be issued on the first node of the job that can be captured by your C program as shown in this code example:

#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define ITERATION_TIME 10000

volatile sig_atomic_t done = 0;
void term(int signum)
  done = 1;

  printf("\n\nTerm signal received.... \n ");
  printf("\n\nDoing clean termination tasks.... \n ");

int main(int argc, char *argv[])
    struct sigaction action;
    memset(&action, 0, sizeof(struct sigaction));
    action.sa_handler = term;
    sigaction(SIGTERM, &action, NULL);
    while (!done)
        printf("\nStill running...");
        int t = sleep(ITERATION_TIME);

    return 0;

In the job script, add the --signal line, and call your program as usual.

#SBATCH --time=00:02:00 
#SBATCH --job-name=prog
#SBATCH --output=job.out
#SBATCH --ntasks=1 
#SBATCH --signal=15@25

srun -n 1  ./my_prog

for i in 1 2 3 4 5 6 7 8 9 10; do
   echo waiting to be killed....
   sleep 10

exit 0

When submitting this job, the term signal is correctly trapped by the C program, which terminates cleanly as the example code output shows below:

slurmstepd: error: *** STEP 19499642.0 ON nid00476 CANCELLED AT 2021-04-01T15:07:20 ***
Still running...

Term signal received.... 

Doing clean termination tasks.... 
done.Thu Apr  1 15:07:21 +03 2021
waiitng to be killed....
Thu Apr  1 15:07:31 +03 2021
waiitng to be killed....
Thu Apr  1 15:07:41 +03 2021
waiitng to be killed....
Thu Apr  1 15:07:51 +03 2021
waiitng to be killed....
Thu Apr  1 15:08:01 +03 2021
waiitng to be killed....
Thu Apr  1 15:08:11 +03 2021
waiitng to be killed....
slurmstepd: error: *** JOB 19499642 ON nid00476 CANCELLED AT 2021-04-01T15:08:20 DUE TO TIME LIMIT ***


Upcoming training

Title: Best Practices for Distributed Deep Learning on Ibex GPUs

Date: April 11th, 1pm - 4pm

Registration Link

With increasing size and complexity of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trival, ranging from few tens of hours to even several days. Exploiting data parallelism exhibited inherently in the training process of DL models, we can distribute training on multiple GPUs on a single or multiple nodes of Ibex. We discuss and demonstrate the use of Horovod, a scalable distributed training framework, to train DL models on multiple GPUs on Ibex. Horovod integrates with Tensorflow 1 & 2, Pytorch and MxNet. We also discuss the some caveats to look for when using Horovod for large mini-batch training.  


Follow us on Twitter

Follow all the latest news on HPC within the Supercomputing Lab and at KAUST, on Twitter @KAUST_HPC.

Previous Announcements


Previous Tips