KAUST Supercomputing Laboratory Newsletter 4th March

In this newsletter:

  • RCAC meeting
  • Shaheen Maintenance: 23rd of March and end of April
  • Application License Server Maintenance by IT on Thursday, 4 March 2021, 8.00 PM to 4:00 AM 5 March
  • AI users survey
  • Tip of the week: Use “timeout” command to automatically restart jobs
  • Follow us on Twitter
  • Previous Announcements
  • Previous Tips

 

Shaheen Maintenance: 23rd of March and April

We would like to announce our next maintenance session on Shaheen on the 23rd of March 2021 between the hours of 8 and 5pm. We plan to apply the latest patches and security updates, as well as reboot and fix the hardware on the system. Access to the files and login nodes should be possible during the outage. 

We would also like to give you an advanced notice for the longer Shaheen outage towards the end of April. The datacentre team will be performing their annual PPM on the power supply equipment. At the same time, we will upgrade Shaheen existing project and scratch filesystems. This is an essential step before bringing our newly acquired filesystem online and providing more project storage space. We estimate that the combined Shaheen outage should take around 4-6 days. We will communicate the details closer to the date.  As always, please contact us at help@hpc.kaust.edu.sa should you have any concerns or questions.

 

RCAC meeting

The project submission deadline for the next RCAC meeting is 31 March 2021. Please note that the RCAC meetings are held once per month. Projects received on or before the submission deadline will be included in the agenda for the subsequent RCAC meeting.The detailed procedures, updated templates and forms are available here: https://www.hpc.kaust.edu.sa/account-applications

 

Application License Server Maintenance by IT on Thursday, 4 March 2021, 8.00 PM to 4:00 AM 5 March

Due to a scheduled maintenance of the Application License Server by IT on Thursday, 4 March 2021, 8.00 PM to 4:00 AM next day, access to the below applications will be impacted on Shaheen and Neser: 

Ansys, AtomistixToolKit (ATK), Converge, Eclipse, Intel Compilers, Material Studio, Mathematica, MATLAB, Tecplot and Totalview.

During these maintenance windows, you may face issues with Intel at compilation and error with the application at runtime.

 

AI users survey

As you are aware, the GPU resources at KSL have been in high demand nearing major conference deadlines. For better management of the current resources and planning for future expansions/extensions, it is important for us to keep up with the changing demand in workloads running on these GPU resources. We have designed a GPU workload characterization user survey to capture the state of your workloads. Please spare some time and complete this survey. We also kindly request AI faculty members to circulate the survey link to your students, postdocs and research scientists. Maximizing the number of responses will help us make more informed decisions which serve you well in near and far future. 

 

Tip of the week: Use “timeout” command to automatically restart jobs

Sometimes our jobs cannot be finished in the 24-hour time limit, and have to be restarted again and again until the calculations are completed successfully. Instead of manually checking the job states and resubmitting the jobscripts, we can use the linux command “timeout” to restart jobs automatically.

For example, if you have a job that will need to run more than 24 hours (suppose the job is regularly checkpointed and can be restarted by just resubmitting the job), you can prepare a Slurm jobscript “my_slurm_jobscript” like this:

#!/bin/bash
…
…
#SBATCH --partition=workq
#SBATCH --nodes=1
#SBATCH --time=24:00:00
…
…
timeout 23h srun --ntasks=32 my_app
if [[ $? -eq 124 ]]; then
  sbatch my_slurm_jobscript
fi

In this job, “my_app” will be running for 23 hours at the most. If “my_app” is stopped by the “timeout” command, an exit code “124” will be returned and the jobscript “my_slurm_jobscript” will be automatically submitted again. Then the jobscrtipt “my_slurm_jobscript” will be repeatedly resubmitted until the calculation is fully completed.

Please note:
#1 The Slurm time limit (24 hours in this case) should be a little bit longer than the “timeout” time limit (23 hours in this case), so that the “sbatch my_slurm_jobscript” will be able to executed.
#2 Do a test to make sure this works for you own case before using it massively in your production runs.
#3 You still need to monitor your jobs regularly to make sure everything is working fine, as not all exceptions can be covered in one simple script.

For more information about the “timeout” command, use “man timeout”.

 

Follow us on Twitter

Follow all the latest news on HPC within the Supercomputing Lab and at KAUST, on Twitter @KAUST_HPC.

Previous Announcements

http://www.hpc.kaust.edu.sa/announcements/

Previous Tips

http://www.hpc.kaust.edu.sa/tip/