Frequently Asked Questions

  1. I am unable to access the clusters (Account Problems)
    1. I have used Dragon(now called Ibex) (or Shaheen, Noor, SMC,...) in the past OR i am a permanent (not-visiting) faculty, staff or student
      • To verify that there isn't a problem with your KAUST portal username/password, try to log into another KAUST system, for example ​KAUST Webmail or the ​KAUST Portal. If you can log into those systems, then Contact the Ibex sysadmins for help. If you are unable to log into any KAUST system, then wait 15 minutes and try again (to see if your account was temporarily locked out due to password failure) or contact the IT Helpdesk for assistance.
    2. I am an external collaborator or visitor to KAUST.
      • If you are a first time user AND you are new to KAUST then there may be a problem with your KAUST Directory Entry. At the time of this writing, adding the necessary Unix attributes to directory entries of visiting staff, students and researchers as well as external collaborator accounts is not automatic. You can check this by having your sponsor or someone with an existing account on Ibex run the command getent passwd username using your KAUST portal username. If that command returns no results then you will need to contact the IT Helpdesk with your KAUST ID Card so they have the necessary attributes added to your directory entry.

        External collaborators accessing Ibex via the VPN will need to have both the attributes described above as well as to have the Ibex login node IP addresses added for VPN access. Contact the Ibex System Administrators for an update list of those IP addresses.

  2. Constraints and Features
    • Ibex makes heavy use of features and the contraints flags to direct jobs onto the appropriate resources. Combined with GRES this is a powerful and flexible way to allow a set of defaults which does the right thing for people who just want to run basic tasks and don't care about architecture, extra memory, accelerators, etc.
    • Below are some examples of how to request different resource configurations. The nodes are weighted such that the least valuable/rare node which can satisfy the request will be used. Be specific if you want a particular shaped resource.
    1. To see a list of full node features:
      [hanksj@dbn-503-5-r:~]$ sinfo --partition=batch --format="%n %f"
      
    2. Specific CPU architecture:

      The Intel nodes perform much better for floating point operations while the AMD nodes are more efficient at integer operations. A common approach to optimizing your workload is to send integer or floating point work to the correct arch. Each node has a feature, either intel or amd, for it's arch. To select one:

      # Intel
      [hanksj@dm511-17:~]$ srun --pty --time=1:00 --constraint=intel bash -l
      [hanksj@dbn711-08-l:~]$ grep vendor /proc/cpuinfo | head -1
      vendor_id	: GenuineIntel
      [hanksj@dbn711-08-l:~]$ 
      
      #To Run on Intel SMC HP Intel nodes
      [hanksj@dm511-17:~]$ srun --pty --time=1:00 --constraint=mpi_intel bash -l
      
      # AMD
      [hanksj@dm511-17:~]$ srun --pty --time=1:00 --constraint=amd bash -l
      [hanksj@db809-12-5:~]$ grep vendor /proc/cpuinfo | head -1
      vendor_id	: AuthenticAMD
      [hanksj@db809-12-5:~]$ 
      
    3. Specific GPU or specific architecture:
      • There are three basic ways to ask for GPUs.
        1. You want a specific count of a specific model of GPU
          • # Request 2 P100 GPUs.
            [hanksj@dm511-17:~]$ srun --pty --time=1:00 --gres=gpu:p100:2 bash -l
            [hanksj@dgpu703-01:~]$ nvidia-smi
        2. You want a specific count of any type of GPU
          • # Request 1 GPU of any kind
            [hanksj@dm511-17:~]$ srun --pty --time=1:00 --gres=gpu:1 bash -l
            [hanksj@dgpu502-01-r:~]$ nvidia-smi 
        3. If there are no nodes available; raise a ticket to the systems team to do a reservation for a specific node clarifying the reasons and scope of work.
  3. How many types of nodes are available on the GPU cluster?
    1. p100
    2. p6000
    3. k6000
    4. k20
    5. k40
  4. Why should I set --time= in all jobs?
    Setting a --time to the best estimate possible for your job accomplishes several important functions:
    • Using the shortest time possible makes the job better suited to running as backfill, making it run sooner for you and increasing overall utilization of the resources.
    • When a future reservation is blocking nodes for maintenance or other purposes, specifying the shortest time possible can allow more jobs to run to completion before the reservation becomes active.
    • Forcing the inclusion of --time in all jobs reduces confusion resulting from job behavior under non-optimal default time limit settings.
    • Learning to estimate how long your applications will run makes you a better and more well-rounded person.
    • max_avail_time:

      You can use max_avail_time to submit a job with a time that is the remaining time until the outage starts minus 2 minutes to allow for the job to start. Of course this will only work if the job has enough priority to start right away. max_avail_time will also cap the time request at 4 days to ensure the job is short enough to be under any long job restrictions.

      [hanksj@dlogin-01:~]$ srun --pty --time=$(max_avail_time) bash
      [hanksj@db812-02-8:~]$ 
      		
  5. Why do I get the following locale error?
    Setting locale failed.
    Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = (unset)
    are supported and installed on your system.

    This is just a warning indicating your locale are not defined so the system is failing back to the standard locale. To avoid receiving these messages, add the following lines to your .bashrc file (it should be in your home directory):

    export LANGUAGE=en_US.UTF-8
    export LC_ALL=en_US.UTF-8
    export LC_CTYPE=en_US.UTF-8
    export LANG=en_US.UTF-8

    Now you can either source your .bashrc file (type source ~/.bashrc) or you can log out and log back in to make sure it works.