Metrics and telemetry¶
Telemetry¶
Telemetry is a numeric measurement recorded in real-time when emitted from the Run:ai cluster.
Metrics¶
Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.
The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai. This enables customers to create custom dashboards or integrate metric data into other monitoring systems.
Run:ai provides metrics via the Run:ai Control-plane API. Previoulsy, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.
Metric and telemetry Scopes¶
Run:ai provides Control-plane API which supports and aggregates metrics at various levels.
Level | Description |
---|---|
Cluster | A cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level |
Node | Data is aggregated at the Node level. |
Node Pool | Data is aggregated at the Node Pool level. |
Workload | Data is aggregated at the Workload level. In some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods |
Pod | The basic execution unit |
Supported Metrics¶
Metric | Cluster | Node Pool | Node | Workload | Pod |
---|---|---|---|---|---|
API | Cluster API | Node Pool API | Workload API | Pod API | |
ALLOCATED_GPU | TRUE | TRUE | TRUE | ||
AVG_WORKLOAD_WAIT_TIME | TRUE | TRUE | |||
CPU_LIMIT_CORES | TRUE | ||||
CPU_MEMORY_LIMIT_BYTES | TRUE | ||||
CPU_MEMORY_REQUEST_BYTES | TRUE | ||||
CPU_MEMORY_USAGE_BYTES | TRUE | TRUE | TRUE | ||
CPU_MEMORY_UTILIZATION | TRUE | TRUE | TRUE | ||
CPU_REQUEST_CORES | TRUE | ||||
CPU_USAGE_CORES | TRUE | TRUE | TRUE | ||
CPU_UTILIZATION | TRUE | TRUE | TRUE | ||
GPU_ALLOCATION | TRUE | ||||
GPU_MEMORY_REQUEST_BYTES | TRUE | ||||
GPU_MEMORY_USAGE_BYTES | TRUE | TRUE | |||
GPU_MEMORY_USAGE_BYTES_PER_GPU | TRUE | TRUE | |||
GPU_MEMORY_UTILIZATION | TRUE | TRUE | |||
GPU_MEMORY_UTILIZATION_PER_GPU | TRU | ||||
GPU_QUOTA | TRUE | TRUE | |||
GPU_UTILIZATION | TRUE | TRUE | TRUE | TRUE | |
GPU_UTILIZATION_PER_GPU | TRUE | TRUE | |||
POD_COUNT | TRUE | ||||
RUNNING_POD_COUNT | TRUE | ||||
TOTAL_GPU | TRUE | TRUE | |||
TOTAL_GPU_NODES | TRUE | TRUE | |||
GPU_UTILIZATION_DISTRIBUTION | TRUE | TRUE | |||
UNALLOCATED_GPU | TRUE | TRUE | |||
Advanced Metrics¶
NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.
Metric | Cluster | Node Pool | Workload | Pod |
---|---|---|---|---|
GPU_FP16_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_FP32_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_FP64_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU | TRUE | |||
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_SM_ACTIVITY_PER_GPU | TRUE | |||
GPU_SM_OCCUPANCY_PER_GPU | TRUE | |||
GPU_TENSOR_ACTIVITY_PER_GPU | TRUE |
¶
Supported telemetry¶
telemetry | Node | Workload |
---|---|---|
API | Node API | Workload API |
WORKLOADS_COUNT | TRUE | |
ALLOCATED_GPUS | TRUE | TRUE |
READY_GPU_NODES | TRUE | |
READY_GPUS | TRUE | |
TOTAL_GPU_NODES | TRUE | |
TOTAL_GPUS | TRUE | |
IDLE_ALLOCATED_GPUS | TRUE | |
FREE_GPUS | TRUE | |
TOTAL_CPU_CORES | TRUE | |
USED_CPU_CORES | TRUE | |
ALLOCATED_CPU_CORES | TRUE | |
TOTAL_GPU_MEMORY_BYTES | TRUE | |
USED_GPU_MEMORY_BYTES | TRUE | |
TOTAL_CPU_MEMORY_BYTES | TRUE | |
USED_CPU_MEMORY_BYTES | TRUE | |
ALLOCATED_CPU_MEMORY_BYTES | TRUE |