GPU Utilisation Monitoring

GPU Utilisation Monitoring

To assist users get the highest performance from their GPU usage a tool has been setup to provide summary informtion on completion of each job. The toolset used to monitor jobs is the NVIDIA Data Centre GPU Manager (DCGM).

DCGM Reports

At the conclusion of each job on Ibex that uses a GPU, DCGM will generate a report similar to the following.

Successfully retrieved statistics for job: 123456789. 
+------------------------------------------------------------------------------+
| GPU ID: 0                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                         | Sun Feb  7 13:30:26 2021                |
| End Time                           | Sun Feb  7 13:42:50 2021                |
| Total Execution Time (sec)         | 743.77                                  |
| No. of Processes                   | 1                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 104146                                  |
| Power Usage (Watts)                | Avg: 140.29, Max: 205.437, Min: 26.073  |
| Max GPU Memory Used (bytes)        | 12407799808                             |
| SM Clock (MHz)                     | Avg: 1206, Max: 1328, Min: 405          |
| Memory Clock (MHz)                 | Avg: 715, Max: 715, Min: 715            |
| SM Utilization (%)                 | Avg: 71, Max: 100, Min: 0               |
| Memory Utilization (%)             | Avg: 34, Max: 52, Min: 0                |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Single Bit ECC Errors              | 0                                       |
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 0                                       |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | Not Supported                           |
|        - Board Limit (%)           | Not Supported                           |
|        - Low Utilization (%)       | Not Supported                           |
|        - Sync Boost (%)            | 0                                       |
+--  Compute Process Utilization  ---+-----------------------------------------+
| PID                                | 1054                                    |
|     Avg SM Utilization (%)         | 4                                       |
|     Avg Memory Utilization (%)     | 1                                       |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

The generated report is saved in a dcgm/ directory that is automatically created by Slurm inside the job's launch directory. If a job used multiple GPUs, then the DCGM report will contain separate statistics for each of the GPUs. Statistics for multiple GPUs on a single node will be combined into a single file. For example, a DCGM report for a job that used two GPUs on one node would look as follows.

Successfully retrieved statistics for job: 123456789. 
+------------------------------------------------------------------------------+
| GPU ID: 0                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                         | Sun Feb  7 13:30:26 2021                |
| End Time                           | Sun Feb  7 13:42:50 2021                |
| Total Execution Time (sec)         | 743.77                                  |
| No. of Processes                   | 1                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 104146                                  |
| Power Usage (Watts)                | Avg: 140.29, Max: 205.437, Min: 26.073  |
| Max GPU Memory Used (bytes)        | 12407799808                             |
| SM Clock (MHz)                     | Avg: 1206, Max: 1328, Min: 405          |
| Memory Clock (MHz)                 | Avg: 715, Max: 715, Min: 715            |
| SM Utilization (%)                 | Avg: 93, Max: 100, Min: 0               |
| Memory Utilization (%)             | Avg: 45, Max: 52, Min: 0                |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Single Bit ECC Errors              | 0                                       |
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 0                                       |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | Not Supported                           |
|        - Board Limit (%)           | Not Supported                           |
|        - Low Utilization (%)       | Not Supported                           |
|        - Sync Boost (%)            | 0                                       |
+--  Compute Process Utilization  ---+-----------------------------------------+
| PID                                | 1054                                    |
|     Avg SM Utilization (%)         | 4                                       |
|     Avg Memory Utilization (%)     | 1                                       |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

+------------------------------------------------------------------------------+
| GPU ID: 1                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                         | Sun Feb  7 13:30:26 2021                |
| End Time                           | Sun Feb  7 13:42:50 2021                |
| Total Execution Time (sec)         | 743.77                                  |
| No. of Processes                   | 1                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 97860                                   |
| Power Usage (Watts)                | Avg: 134.28, Max: 222.524, Min: 26.315  |
| Max GPU Memory Used (bytes)        | 12290359296                             |
| SM Clock (MHz)                     | Avg: 1174, Max: 1328, Min: 405          |
| Memory Clock (MHz)                 | Avg: 715, Max: 715, Min: 715            |
| SM Utilization (%)                 | Avg: 93, Max: 100, Min: 0               |
| Memory Utilization (%)             | Avg: 45, Max: 51, Min: 0                |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Single Bit ECC Errors              | 0                                       |
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 0                                       |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | Not Supported                           |
|        - Board Limit (%)           | Not Supported                           |
|        - Low Utilization (%)       | Not Supported                           |
|        - Sync Boost (%)            | 0                                       |
+--  Compute Process Utilization  ---+-----------------------------------------+
| PID                                | 1055                                    |
|     Avg SM Utilization (%)         | 4                                       |
|     Avg Memory Utilization (%)     | 1                                       |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

If a job used multiple GPUs spread across multiple nodes, then DCGM will generate a separate report similar to the above for each node: for a four node job with two GPUs per node, DCGM would generate four separate reports and each report would include separate statistics for each of the two GPUs on each node.

Interpreting DCGM reports

While the DCGM reports contain quite a lot of useful and interesting information, the most important information for most Ibex users will be the following two lines

  • the maximum amount of GPU memory allocated in bytes (Max GPU Memory Used (bytes),
  • statistics on the GPU compute utilization (SM Utilization (%))
  • statistics on the GPU memory utilization (Memory Utilization (%)):

A job that is making good use of the GPU should have high average compute and memory utilization rates.

+------------------------------------------------------------------------------+
| GPU ID: 0                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                         | Sun Feb  7 13:30:26 2021                |
| End Time                           | Sun Feb  7 13:42:50 2021                |
| Total Execution Time (sec)         | 743.77                                  |
| No. of Processes                   | 1                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 104146                                  |
| Power Usage (Watts)                | Avg: 140.29, Max: 205.437, Min: 26.073  |
| Max GPU Memory Used (bytes)        | 12407799808                             |
| SM Clock (MHz)                     | Avg: 1206, Max: 1328, Min: 405          |
| Memory Clock (MHz)                 | Avg: 715, Max: 715, Min: 715            |
| SM Utilization (%)                 | Avg: 93, Max: 100, Min: 0               |
| Memory Utilization (%)             | Avg: 45, Max: 52, Min: 0                |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Single Bit ECC Errors              | 0                                       |
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 0                                       |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | Not Supported                           |
|        - Board Limit (%)           | Not Supported                           |
|        - Low Utilization (%)       | Not Supported                           |
|        - Sync Boost (%)            | 0                                       |
+--  Compute Process Utilization  ---+-----------------------------------------+
| PID                                | 1054                                    |
|     Avg SM Utilization (%)         | 4                                       |
|     Avg Memory Utilization (%)     | 1                                       |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+