Metrics and telemetry¶

Telemetry¶

Telemetry is a numeric measurement recorded in real-time when emitted from the Run:ai cluster.

Metrics¶

Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.

The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai. This enables customers to create custom dashboards or integrate metric data into other monitoring systems.

Run:ai provides metrics via the Run:ai Control-plane API. Previoulsy, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.

Metric and telemetry Scopes¶

Run:ai provides Control-plane API which supports and aggregates metrics at various levels.

Level	Description
Cluster	A cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level
Node	Data is aggregated at the Node level.
Node Pool	Data is aggregated at the Node Pool level.
Workload	Data is aggregated at the Workload level. In some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods
Pod	The basic execution unit
Project	The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.
Department	Departments are a grouping of projects.
## Supported Metrics

Metric	Cluster	Node Pool	Node	Workload	Pod	Project	Department
API	Cluster API	Node Pool API		Workload API	Pod API
ALLOCATED_GPU	TRUE	TRUE		TRUE
AVG_WORKLOAD_WAIT_TIME	TRUE	TRUE
CPU_LIMIT_CORES				TRUE
CPU_MEMORY_LIMIT_BYTES				TRUE
CPU_MEMORY_REQUEST_BYTES				TRUE
CPU_MEMORY_USAGE_BYTES			TRUE	TRUE	TRUE
CPU_MEMORY_UTILIZATION	TRUE	TRUE	TRUE
CPU_REQUEST_CORES				TRUE
CPU_USAGE_CORES			TRUE	TRUE	TRUE
CPU_UTILIZATION	TRUE	TRUE	TRUE
GPU_ALLOCATION				TRUE		TRUE	TRUE
GPU_MEMORY_REQUEST_BYTES				TRUE
GPU_MEMORY_USAGE_BYTES				TRUE	TRUE
GPU_MEMORY_USAGE_BYTES_PER_GPU			TRUE		TRUE
GPU_MEMORY_UTILIZATION	TRUE	TRUE
GPU_MEMORY_UTILIZATION_PER_GPU			TRU
GPU_QUOTA	TRUE	TRUE				TRUE	TRUE
GPU_UTILIZATION	TRUE	TRUE		TRUE	TRUE
GPU_UTILIZATION_PER_GPU			TRUE		TRUE
POD_COUNT				TRUE
RUNNING_POD_COUNT				TRUE
TOTAL_GPU	TRUE	TRUE
TOTAL_GPU_NODES	TRUE	TRUE
GPU_UTILIZATION_DISTRIBUTION	TRUE	TRUE
UNALLOCATED_GPU	TRUE	TRUE
CPU_QUOTA_MILLICORES						TRUE	TRUE
CPU_MEMORY_QUOTA_MB						TRUE	TRUE
CPU_ALLOCATION_MILLICORES						TRUE	TRUE
CPU_MEMORY_ALLOCATION_MB						TRUE	TRUE

Advanced Metrics¶

NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.

Metric	Cluster	Node Pool	Workload	Pod
GPU_FP16_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_FP32_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_FP64_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU				TRUE
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU				TRUE
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU				TRUE
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU				TRUE
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU				TRUE
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU				TRUE
GPU_SM_ACTIVITY_PER_GPU				TRUE
GPU_SM_OCCUPANCY_PER_GPU				TRUE
GPU_TENSOR_ACTIVITY_PER_GPU				TRUE

¶

Supported telemetry¶

telemetry	Node	Workload	Project	Department
API	Node API	Workload API
WORKLOADS_COUNT		TRUE
ALLOCATED_GPUS	TRUE	TRUE	TRUE	TRUE
READY_GPU_NODES	TRUE
READY_GPUS	TRUE
TOTAL_GPU_NODES	TRUE
TOTAL_GPUS	TRUE
IDLE_ALLOCATED_GPUS	TRUE
FREE_GPUS	TRUE
TOTAL_CPU_CORES	TRUE
USED_CPU_CORES	TRUE
ALLOCATED_CPU_CORES	TRUE		TRUE	TRUE
TOTAL_GPU_MEMORY_BYTES	TRUE
USED_GPU_MEMORY_BYTES	TRUE
TOTAL_CPU_MEMORY_BYTES	TRUE
USED_CPU_MEMORY_BYTES	TRUE
ALLOCATED_CPU_MEMORY_BYTES	TRUE		TRUE	TRUE
GPU_QUOTA			TRUE	TRUE
CPU_QUOTA			TRUE	TRUE
MEMORY_QUOTA			TRUE	TRUE
GPU_ALLOCATION_NON_PREEMPTIBLE			TRUE	TRUE
CPU_ALLOCATION_NON_PREEMPTIBLE			TRUE	TRUE
MEMORY_ALLOCATION_NON_PREEMPTIBLE			TRUE	TRUE