Welcome to KAUST Supercomputing Laboratory


KAUST Supercomputing Lab (KSL)'s mission is to inspire and enable scientific, economic and social advances through the development and application of HPC solutions, through collaboration with KAUST researchers and partners, and through the provision of world-class computational systems and services.

  • Offering world-class HPC and data resources in a fashion that stimulates research and development.
  • Assisting KAUST researchers and partners to exploit the HPC resources at KAUST with a combination of training, consultation and collaboration.
  • Collaborating with KAUST researchers in the joint development of HPC solutions that advance the scientific knowledge in the disciplines strategic to KAUST mission.
  • Growing HPC capability at KSL over time to meet the future needs of the KAUST community.

 


 

  • In this newsletter:

    • Data Centre and Shaheen Maintenance, 22-29 April 2021
    • Saudi HPC/AI 2021
    • RCAC meeting
    • Tip of the week: Advance capture of the termination of a SLURM Job
    • Upcoming Training
    • Follow us on Twitter
    • Previous Announcements
    • Previous Tips

     

    Data Centre and Shaheen Maintenance, 22-29 April 2021

    Our next maintenance session on Shaheen will take place from 15:30 on the 22nd April until 17:00 on the 29th April. The data centre team will be performing their annual PPM on the power supply equipment. At the same time, we will upgrade the software version and firmware on Shaheen and Neser’s existing Lustre (project and scratch) filesystem. This is an essential step before bringing our newly acquired filesystem online and providing more project storage space. As we are upgrading the Lustre filesystem there will be no access to data during this period.

    Please contact us at help@hpc.kaust.edu.sa  should you have any concerns or questions.

     

    Saudi HPC/AI Conference 2021

    It is our pleasure to invite you to the Annual Saudi HPC/AI Conference 2021.
    The conference is organized by a committee that includes representatives from many global and local HPC/AI players and will take place virtually on April 7-8, 2021.
    This is the 10th edition of the conference. In all its editions, the conference has been the premier regional HPC/AI event where local and international HPC/AI users, experts, and technology providers have been socializing, exchanging business ideas and experiences.
    Please use the following link for registration.
    We look forward to welcoming you at the Saudi HPC/AI Conference 2021
     

    RCAC meeting

    The project submission deadline for the next RCAC meeting is 30th April 2021. Please note that the RCAC meetings are held once per month. Projects received on or before the submission deadline will be included in the agenda for the subsequent RCAC meeting.The detailed procedures, updated templates and forms are available here: https://www.hpc.kaust.edu.sa/account-applications

     

    Tip of the week: Advance capture of the termination of a SLURM Job

    Did you know that you could ask SLURM to ping you a few minutes before it terminates your job? To do so, add the following configuration parameter in the head line of your job:

    #SBATCH --signal=15@25
    

    Fifteen seconds before the job, a signal will be issued on the first node of the job that can be captured by your C program as shown in this code example:

    #include <signal.h>
    #include <stdio.h>
    #include <string.h>
    #include <unistd.h>
    
    #define ITERATION_TIME 10000
    
    volatile sig_atomic_t done = 0;
     
    void term(int signum)
    {
      done = 1;
    
      printf("\n\nTerm signal received.... \n ");
      printf("\n\nDoing clean termination tasks.... \n ");
      fflush(stdout);
    }
    
    int main(int argc, char *argv[])
    {
        struct sigaction action;
        memset(&action, 0, sizeof(struct sigaction));
        action.sa_handler = term;
        sigaction(SIGTERM, &action, NULL);
     
        while (!done)
        {
            printf("\nStill running...");
            fflush(stdout);
            int t = sleep(ITERATION_TIME);
        }
    
     
        printf("\ndone.");
        return 0;
    }
    

    In the job script, add the --signal line, and call your program as usual.

    #!/bin/bash
    #SBATCH --time=00:02:00 
    #SBATCH --job-name=prog
    #SBATCH --output=job.out
    #SBATCH --ntasks=1 
    #SBATCH --signal=15@25
    
    srun -n 1  ./my_prog
    
    for i in 1 2 3 4 5 6 7 8 9 10; do
       date
       echo waiting to be killed....
       sleep 10
    done
    
    exit 0

    When submitting this job, the term signal is correctly trapped by the C program, which terminates cleanly as the example code output shows below:

    slurmstepd: error: *** STEP 19499642.0 ON nid00476 CANCELLED AT 2021-04-01T15:07:20 ***
    Still running...
    
    Term signal received.... 
    
    Doing clean termination tasks.... 
     
    done.Thu Apr  1 15:07:21 +03 2021
    waiitng to be killed....
    Thu Apr  1 15:07:31 +03 2021
    waiitng to be killed....
    Thu Apr  1 15:07:41 +03 2021
    waiitng to be killed....
    Thu Apr  1 15:07:51 +03 2021
    waiitng to be killed....
    Thu Apr  1 15:08:01 +03 2021
    waiitng to be killed....
    Thu Apr  1 15:08:11 +03 2021
    waiitng to be killed....
    slurmstepd: error: *** JOB 19499642 ON nid00476 CANCELLED AT 2021-04-01T15:08:20 DUE TO TIME LIMIT ***
    

     

    Upcoming training

    Title: Best Practices for Distributed Deep Learning on Ibex GPUs

    Date: April 11th, 1pm - 4pm

    Registration Link

    With increasing size and complexity of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trival, ranging from few tens of hours to even several days. Exploiting data parallelism exhibited inherently in the training process of DL models, we can distribute training on multiple GPUs on a single or multiple nodes of Ibex. We discuss and demonstrate the use of Horovod, a scalable distributed training framework, to train DL models on multiple GPUs on Ibex. Horovod integrates with Tensorflow 1 & 2, Pytorch and MxNet. We also discuss the some caveats to look for when using Horovod for large mini-batch training.  

     

    Follow us on Twitter

    Follow all the latest news on HPC within the Supercomputing Lab and at KAUST, on Twitter @KAUST_HPC.

    Previous Announcements

    http://www.hpc.kaust.edu.sa/announcements/

    Previous Tips

    http://www.hpc.kaust.edu.sa/tip/