Skip to content

Metrics via API

What are Metrics

Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.

The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai to enable customers to create custom dashboards or integrate metric data into other monitoring systems.

Run:ai provides metrics via the Run:ai Control-plane API. In the past, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.

Metric Scopes

Run:ai provides Control-plane API which supports and aggregates metrics at various levels.

Level Description
Cluster Cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level
Node Pool Metrics are aggregated at the Node Pool level.
Workload Metrics are aggregated at the Workload level. Some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods
Pod The basic execution unit

Supported Metrics

Metric Cluster Node Pool Workload Pod
API Cluster API Node Pool API Workload API Pod API
ALLOCATED_GPU TRUE TRUE
AVG_WORKLOAD_WAIT_TIME TRUE TRUE
CPU_LIMIT_CORES TRUE
CPU_MEMORY_LIMIT_BYTES TRUE
CPU_MEMORY_REQUEST_BYTES TRUE
CPU_MEMORY_USAGE_BYTES TRUE TRUE
CPU_MEMORY_UTILIZATION TRUE TRUE
CPU_REQUEST_CORES TRUE
CPU_USAGE_CORES TRUE TRUE
CPU_UTILIZATION TRUE TRUE
GPU_ALLOCATION TRUE
GPU_MEMORY_REQUEST_BYTES TRUE
GPU_MEMORY_USAGE_BYTES TRUE TRUE
GPU_MEMORY_USAGE_BYTES_PER_GPU TRUE
GPU_MEMORY_UTILIZATION TRUE TRUE
GPU_QUOTA TRUE TRUE
GPU_UTILIZATION TRUE TRUE TRUE TRUE
GPU_UTILIZATION_PER_GPU TRUE
POD_COUNT TRUE
RUNNING_POD_COUNT TRUE
TOTAL_GPU TRUE TRUE

Advanced Metrics

NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.

Metric Cluster Node Pool Workload Pod
GPU_FP16_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_FP32_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_FP64_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU TRUE
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU TRUE
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU TRUE
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU TRUE
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU TRUE
GPU_SM_ACTIVITY_PER_GPU TRUE
GPU_SM_OCCUPANCY_PER_GPU TRUE
GPU_TENSOR_ACTIVITY_PER_GPU TRUE