Skip to main content
KAUST Logo

Main navigation

  • Home
  • My KSL
    • KAUST SSO Login
  • Documentation
    • Apply for Access
    • Login to Shaheen
    • Extend project
    • Quickstart Guide
    • FAQs
    • Newsletters
  • Technical Guides
  • Events
    • Search
    • Calendar
  • Contact Us

User account menu

  • Log in

Distributed Deep Learning on IBEX

Breadcrumb

  • Home
  • Distributed Deep Learning on IBEX

Distributed Deep Learning on IBEX

The KAUST Supercomputing Core Lab invites you to join the Distributed Deep Learning Workshop on IBEX, a hands-on training designed to help users efficiently scale AI workloads across multiple GPUs and compute nodes using IBEX’s high-performance computing environment.

This workshop provides a practical introduction to essential distributed training frameworks for accelerating training of models on IBEX GPUs using data and model parallelism. We will focus on PyTorch Distributed (DDP), DeepSpeed, Fully Sharded Data Parallel (FSDP) and NVIDIA NeMo demonstrates how to scale from one to many GPUs on a single and multiple nodes of IBEX. 

Register here: Distributed Deep Learning Workshop on IBEX  

Who should attend

  • Researchers working with ML and DL models
  • Data scientists and computational scientists
  • AI engineers working with GPU-intensive workloads
  • Anyone interested in scaling model training on HPC systems

Learning outcomes
After attending, participants will be able to:

  • Familiarize with distributed training frameworks (DDP, DeepSpeed, FSDP, NVIDIA NeMo)
  • Launch and manage multi-GPU and multi-node jobs using SLURM on IBEX
  • Through hands-on exercises understand the limitations of models and frameworks with respect to their scaling on multiple GPUs – “using more compute resources doesn’t alway mean faster model training” 

Important Note on Workshop Scope

This workshop focuses on scaling and distributing existing deep learning workloads rather than teaching fundamental Python or neural network concepts. Attendees are expected to have prior familiarity with Python-based ML frameworks (e.g., PyTorch) and basic model training. The sessions will emphasize practical usage of distributed training frameworks and optimizing performance at scale on IBEX—not introductory model development.

Agenda

Day 1:

9:00 – 10:00 — Distributed Deep Learning Overview

10:00 - 10:15 — Coffee break

10:15 – 12:00 — Hands-On Session: PyTorch Distributed Data Parallel 

12:00 – 1:00 — Lunch Break

1:00 – 1:45 — Hands-On Session: DeepSpeed 

1:45 - 2:00 – Coffee break

2:00 – 3:00 — Hands-On Session: DeepSpeed

Day 2:

9:00 – 10:00 —  Hands-On Session: Fully-Shared Data Parallel

10:00 - 10:15  — Coffee break

10:15 – 12:00 — Hands-On Session: Fully-Shared Data Parallel

12:00 – 1:00 — Lunch Break

1:00 – 2:15 — Hands-On Session: NVIDIA NeMo

 

Register here: Distributed Deep Learning Workshop on IBEX  

For any questions, please contact: training@hpc.kaust.edu.sa
This opportunity is brought to you by the KAUST Core Labs – Supercomputing Core Lab.

2026-02-09 09:00 - 15:00
2026-02-10 09:00 - 15:00
Data Science
Mohsin Shaikh
Profile picture for user shaima0d

© 2026 King Abdullah University of Science and Technology. All rights reserved.