KAUST Supercomputing Laboratory Newsletter 8th April

In this newsletter:

  • Data Centre and Shaheen Maintenance, 22-29 April 2021
  • RCAC meeting
  • Upcoming training
  • Tip of the week: Distributed Copy
  • Upcoming Training
  • Follow us on Twitter
  • Previous Announcements
  • Previous Tips

 

Data Centre and Shaheen Maintenance, 22-29 April 2021

Our next maintenance session on Shaheen will take place from 15:30 on the 22nd April until 17:00 on the 29th April. The data centre team will be performing their annual PPM on the power supply equipment. At the same time, we will upgrade the software version and firmware on Shaheen and Neser’s existing Lustre (project and scratch) filesystem. This is an essential step before bringing our newly acquired filesystem online and providing more project storage space. As we are upgrading the Lustre filesystem there will be no access to data during this period.

Please contact us at help@hpc.kaust.edu.sa  should you have any concerns or questions.

 

RCAC meeting

The project submission deadline for the next RCAC meeting is 30th April 2021. Please note that the RCAC meetings are held once per month. Projects received on or before the submission deadline will be included in the agenda for the subsequent RCAC meeting.The detailed procedures, updated templates and forms are available here: https://www.hpc.kaust.edu.sa/account-applications

 

Upcoming training

Title: Best Practices for Distributed Deep Learning on Ibex GPUs

Date: April 11th, 1pm - 4pm

Registration Link

With increasing size and complexity of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trival, ranging from few tens of hours to even several days. Exploiting data parallelism exhibited inherently in the training process of DL models, we can distribute training on multiple GPUs on a single or multiple nodes of Ibex. We discuss and demonstrate the use of Horovod, a scalable distributed training framework, to train DL models on multiple GPUs on Ibex. Horovod integrates with Tensorflow 1 & 2, Pytorch and MxNet. We also discuss the some caveats to look for when using Horovod for large mini-batch training.  

 

Tip of the week: Distributed Copy

dcp or distributed copy is a MPI-based copy tool developed by Lawrence Livermore National Lab (LLNL) as part of their mpifileutils suite. We have installed it on Shaheen. Here is an example jobscript to launch a data moving job with dcp:

#!/bin/bash  
#SBATCH --ntasks=4 
#SBATCH --time=01:00:00 
#SBATCH --hint=nomultithread  
module load mpifileutils 
time srun -n ${SLURM_NTASKS} dcp --verbose --progress 60 --preserve /path/to/source/directory /path/to/destination/directory 

The above script launches dcp in parallel on with 4 MPI processes.

--progress 60 means that the progress of the operation will be reported every 60 seconds.
--preserve means that the ACL permissions, group ownership, timestamps and extended attributes will be preserved on the files in the destination directory as they were in the parent/source directory.

This tip is reprinted from this website where you can also find additional details on the topic.

 

Follow us on Twitter

Follow all the latest news on HPC within the Supercomputing Lab and at KAUST, on Twitter @KAUST_HPC.

Previous Announcements

http://www.hpc.kaust.edu.sa/announcements/

Previous Tips

http://www.hpc.kaust.edu.sa/tip/