Skip to content

Metrics and telemetry

Telemetry

Telemetry is a numeric measurement recorded in real-time when emitted from the Run:ai cluster.

Metrics

Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.


The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai. This enables customers to create custom dashboards or integrate metric data into other monitoring systems.

Run:ai provides metrics via the Run:ai Control-plane API. Previoulsy, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.

Metric and telemetry Scopes

Run:ai provides Control-plane API which supports and aggregates metrics at various levels.

Level Description
Cluster A cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level
Node Data is aggregated at the Node level.
Node Pool Data is aggregated at the Node Pool level.
Workload Data is aggregated at the Workload level. In some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods
Pod The basic execution unit
Project The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.
Department Departments are a grouping of projects.
## Supported Metrics
Metric Cluster Node Pool Node Workload Pod Project Department
API Cluster API Node Pool API Workload API Pod API
ALLOCATED_GPU TRUE TRUE TRUE
AVG_WORKLOAD_WAIT_TIME TRUE TRUE
CPU_LIMIT_CORES TRUE
CPU_MEMORY_LIMIT_BYTES TRUE
CPU_MEMORY_REQUEST_BYTES TRUE
CPU_MEMORY_USAGE_BYTES TRUE TRUE TRUE
CPU_MEMORY_UTILIZATION TRUE TRUE TRUE
CPU_REQUEST_CORES TRUE
CPU_USAGE_CORES TRUE TRUE TRUE
CPU_UTILIZATION TRUE TRUE TRUE
GPU_ALLOCATION TRUE TRUE TRUE
GPU_MEMORY_REQUEST_BYTES TRUE
GPU_MEMORY_USAGE_BYTES TRUE TRUE
GPU_MEMORY_USAGE_BYTES_PER_GPU TRUE TRUE
GPU_MEMORY_UTILIZATION TRUE TRUE
GPU_MEMORY_UTILIZATION_PER_GPU TRU
GPU_QUOTA TRUE TRUE TRUE TRUE
GPU_UTILIZATION TRUE TRUE TRUE TRUE
GPU_UTILIZATION_PER_GPU TRUE TRUE
POD_COUNT TRUE
RUNNING_POD_COUNT TRUE
TOTAL_GPU TRUE TRUE
TOTAL_GPU_NODES TRUE TRUE
GPU_UTILIZATION_DISTRIBUTION TRUE TRUE
UNALLOCATED_GPU TRUE TRUE
CPU_QUOTA_MILLICORES TRUE TRUE
CPU_MEMORY_QUOTA_MB TRUE TRUE
CPU_ALLOCATION_MILLICORES TRUE TRUE
CPU_MEMORY_ALLOCATION_MB TRUE TRUE

Advanced Metrics

NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.

Metric Cluster Node Pool Workload Pod
GPU_FP16_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_FP32_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_FP64_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU TRUE
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU TRUE
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU TRUE
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU TRUE
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU TRUE
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU TRUE
GPU_SM_ACTIVITY_PER_GPU TRUE
GPU_SM_OCCUPANCY_PER_GPU TRUE
GPU_TENSOR_ACTIVITY_PER_GPU TRUE

Supported telemetry

telemetry Node Workload Project Department
API Node API Workload API
WORKLOADS_COUNT TRUE
ALLOCATED_GPUS TRUE TRUE TRUE TRUE
READY_GPU_NODES TRUE
READY_GPUS TRUE
TOTAL_GPU_NODES TRUE
TOTAL_GPUS TRUE
IDLE_ALLOCATED_GPUS TRUE
FREE_GPUS TRUE
TOTAL_CPU_CORES TRUE
USED_CPU_CORES TRUE
ALLOCATED_CPU_CORES TRUE TRUE TRUE
TOTAL_GPU_MEMORY_BYTES TRUE
USED_GPU_MEMORY_BYTES TRUE
TOTAL_CPU_MEMORY_BYTES TRUE
USED_CPU_MEMORY_BYTES TRUE
ALLOCATED_CPU_MEMORY_BYTES TRUE TRUE TRUE
GPU_QUOTA TRUE TRUE
CPU_QUOTA TRUE TRUE
MEMORY_QUOTA TRUE TRUE
GPU_ALLOCATION_NON_PREEMPTIBLE TRUE TRUE
CPU_ALLOCATION_NON_PREEMPTIBLE TRUE TRUE
MEMORY_ALLOCATION_NON_PREEMPTIBLE TRUE TRUE