Metrics via API

What are Metrics¶

Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.

The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai to enable customers to create custom dashboards or integrate metric data into other monitoring systems.

Run:ai provides metrics via the Run:ai Control-plane API. In the past, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.

Metric Scopes¶

Run:ai provides Control-plane API which supports and aggregates metrics at various levels.

Level	Description
Cluster	Cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level
Node Pool	Metrics are aggregated at the Node Pool level.
Workload	Metrics are aggregated at the Workload level. Some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods
Pod	The basic execution unit

Supported Metrics¶

Metric	Cluster	Node Pool	Workload	Pod
API	Cluster API	Node Pool API	Workload API	Pod API
ALLOCATED_GPU	TRUE	TRUE
AVG_WORKLOAD_WAIT_TIME	TRUE	TRUE
CPU_LIMIT_CORES			TRUE
CPU_MEMORY_LIMIT_BYTES			TRUE
CPU_MEMORY_REQUEST_BYTES			TRUE
CPU_MEMORY_USAGE_BYTES			TRUE	TRUE
CPU_MEMORY_UTILIZATION	TRUE	TRUE
CPU_REQUEST_CORES			TRUE
CPU_USAGE_CORES			TRUE	TRUE
CPU_UTILIZATION	TRUE	TRUE
GPU_ALLOCATION			TRUE
GPU_MEMORY_REQUEST_BYTES			TRUE
GPU_MEMORY_USAGE_BYTES			TRUE	TRUE
GPU_MEMORY_USAGE_BYTES_PER_GPU				TRUE
GPU_MEMORY_UTILIZATION	TRUE	TRUE
GPU_QUOTA	TRUE	TRUE
GPU_UTILIZATION	TRUE	TRUE	TRUE	TRUE
GPU_UTILIZATION_PER_GPU				TRUE
POD_COUNT			TRUE
RUNNING_POD_COUNT			TRUE
TOTAL_GPU	TRUE	TRUE

Advanced Metrics¶

NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.

Metric	Cluster	Node Pool	Workload	Pod
GPU_FP16_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_FP32_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_FP64_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU				TRUE
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU				TRUE
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU				TRUE
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU				TRUE
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU				TRUE
GPU_SM_ACTIVITY_PER_GPU				TRUE
GPU_SM_OCCUPANCY_PER_GPU				TRUE
GPU_TENSOR_ACTIVITY_PER_GPU				TRUE