Announcements, 10th February 2015
Disruption to Shaheen earlier today
Earlier today, a fault occurred on both Shaheen front end nodes, rendering them completely unresponsive. To remedy the situation, both front end nodes were rebooted. We apologise for any disruption this may have caused. The root cause of the problem is still under investigation.
Maintenance Session Tuesday 17th February
The next maintenance session will be on Tuesday 17th February from 12:00 until 17:00. There will be no access to the system during this period.
Invitation to attend the second KAUST-NVIDIA workshop on "Accelerating Scientific Applications using GPUs" on February 17th, 2015 The Supercomputing Laboratory is pleased to announce the second KAUST-NVIDIA one-day workshop about accelerating scientific applications using GPUs on February 17th. The registration is free but required using this link.
The event will be followed up by a two day GPU hack-a-thon in which selected teams of developers will be guided by OpenACC and CUDA mentors from NVIDIA and KAUST to port and accelerate their domain science application to GPU accelerators. Space will be limited to 3-4 teams. More details will be communicated to you soon.
Please check the workshop webpage for more details and contact us at training@hpc.kaust.edu.sa if you need further information.
Tip of the Week: Getting your job through the queue faster
Submit shorter jobs. If your application has the capability to checkpoint and restart, consider submitting your job for shorter time periods. On a system like Shaheen or Neser, there are many opportunities for backfilling jobs. Backfill is a technique the scheduler uses to keep the system busy. If there is a large job at the top of the queue, the system will need to drain resources in order to schedule that job. During that time, short jobs can run. Jobs that request short wall clock times are good candidates for backfill. You can check the queue with llq and llvis commands.
Make sure the wall clock time you request is accurate. As noted above, shorter jobs are easier to schedule. Many users unnecessarily enter the largest wall clock time possible as a default and therefore, this will increase your waiting time, especially for runs that could be executed in a few minutes.
Run jobs before scheduled system maintenance. The LoadLeveler queues must be drained of all jobs before a maintenance session so at this time there is an opportunity for good turn-around for shorter jobs. Make sure that the wall clock time is smaller than the remaining time before the maintenance session starts.
Follow us on Twitter
Follow all the latest news on HPC within the Supercomputing Lab and at KAUST, on Twitter @KAUST_HPC.