Welcome to KAUST Supercomputing Laboratory

Shaheen II Node Status

Allocated: 3378
Idle: 2792
Other: 4

Total: 6174
Updated 15 April 2021 07:22

KAUST Supercomputing Lab (KSL)'s mission is to inspire and enable scientific, economic and social advances through the development and application of HPC solutions, through collaboration with KAUST researchers and partners, and through the provision of world-class computational systems and services.

  • Offering world-class HPC and data resources in a fashion that stimulates research and development.
  • Assisting KAUST researchers and partners to exploit the HPC resources at KAUST with a combination of training, consultation and collaboration.
  • Collaborating with KAUST researchers in the joint development of HPC solutions that advance the scientific knowledge in the disciplines strategic to KAUST mission.
  • Growing HPC capability at KSL over time to meet the future needs of the KAUST community.

 


 

  • In this newsletter:

    • Data Centre and Shaheen Maintenance, 22-29 April 2021
    • RCAC meeting
    • Upcoming training
    • Tip of the week: Distributed Copy
    • Upcoming Training
    • Follow us on Twitter
    • Previous Announcements
    • Previous Tips

     

    Data Centre and Shaheen Maintenance, 22-29 April 2021

    Our next maintenance session on Shaheen will take place from 15:30 on the 22nd April until 17:00 on the 29th April. The data centre team will be performing their annual PPM on the power supply equipment. At the same time, we will upgrade the software version and firmware on Shaheen and Neser’s existing Lustre (project and scratch) filesystem. This is an essential step before bringing our newly acquired filesystem online and providing more project storage space. As we are upgrading the Lustre filesystem there will be no access to data during this period.

    Please contact us at help@hpc.kaust.edu.sa  should you have any concerns or questions.

     

    RCAC meeting

    The project submission deadline for the next RCAC meeting is 30th April 2021. Please note that the RCAC meetings are held once per month. Projects received on or before the submission deadline will be included in the agenda for the subsequent RCAC meeting.The detailed procedures, updated templates and forms are available here: https://www.hpc.kaust.edu.sa/account-applications

     

    Upcoming training

    Title: Best Practices for Distributed Deep Learning on Ibex GPUs

    Date: April 11th, 1pm - 4pm

    Registration Link

    With increasing size and complexity of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trival, ranging from few tens of hours to even several days. Exploiting data parallelism exhibited inherently in the training process of DL models, we can distribute training on multiple GPUs on a single or multiple nodes of Ibex. We discuss and demonstrate the use of Horovod, a scalable distributed training framework, to train DL models on multiple GPUs on Ibex. Horovod integrates with Tensorflow 1 & 2, Pytorch and MxNet. We also discuss the some caveats to look for when using Horovod for large mini-batch training.  

     

    Tip of the week: Distributed Copy

    dcp or distributed copy is a MPI-based copy tool developed by Lawrence Livermore National Lab (LLNL) as part of their mpifileutils suite. We have installed it on Shaheen. Here is an example jobscript to launch a data moving job with dcp:

    #!/bin/bash  
    #SBATCH --ntasks=4 
    #SBATCH --time=01:00:00 
    #SBATCH --hint=nomultithread  
    module load mpifileutils 
    time srun -n ${SLURM_NTASKS} dcp --verbose --progress 60 --preserve /path/to/source/directory /path/to/destination/directory 

    The above script launches dcp in parallel on with 4 MPI processes.

    --progress 60 means that the progress of the operation will be reported every 60 seconds.
    --preserve means that the ACL permissions, group ownership, timestamps and extended attributes will be preserved on the files in the destination directory as they were in the parent/source directory.

    This tip is reprinted from this website where you can also find additional details on the topic.

     

    Follow us on Twitter

    Follow all the latest news on HPC within the Supercomputing Lab and at KAUST, on Twitter @KAUST_HPC.

    Previous Announcements

    http://www.hpc.kaust.edu.sa/announcements/

    Previous Tips

    http://www.hpc.kaust.edu.sa/tip/