Metrics and telemetry¶
Telemetry¶
Telemetry is a numeric measurement recorded in real-time when emitted from the Run:ai cluster.
Metrics¶
Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.
The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai. This enables customers to create custom dashboards or integrate metric data into other monitoring systems.
Run:ai provides metrics via the Run:ai Control-plane API. Previoulsy, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.
Metric and telemetry Scopes¶
Run:ai provides Control-plane API which supports and aggregates metrics at various levels.
| Level | Description |
|---|---|
| Cluster | A cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level |
| Node | Data is aggregated at the Node level. |
| Node Pool | Data is aggregated at the Node Pool level. |
| Workload | Data is aggregated at the Workload level. In some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods |
| Pod | The basic execution unit |
| Project | The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives. |
| Department | Departments are a grouping of projects. |
| ## Supported Metrics |
| Metric | Cluster | Node Pool | Node | Workload | Pod | Project | Department |
|---|---|---|---|---|---|---|---|
| API | Cluster API | Node Pool API | Workload API | Pod API | |||
| ALLOCATED_GPU | TRUE | TRUE | TRUE | ||||
| AVG_WORKLOAD_WAIT_TIME | TRUE | TRUE | |||||
| CPU_LIMIT_CORES | TRUE | ||||||
| CPU_MEMORY_LIMIT_BYTES | TRUE | ||||||
| CPU_MEMORY_REQUEST_BYTES | TRUE | ||||||
| CPU_MEMORY_USAGE_BYTES | TRUE | TRUE | TRUE | ||||
| CPU_MEMORY_UTILIZATION | TRUE | TRUE | TRUE | ||||
| CPU_REQUEST_CORES | TRUE | ||||||
| CPU_USAGE_CORES | TRUE | TRUE | TRUE | ||||
| CPU_UTILIZATION | TRUE | TRUE | TRUE | ||||
| GPU_ALLOCATION | TRUE | TRUE | TRUE | ||||
| GPU_MEMORY_REQUEST_BYTES | TRUE | ||||||
| GPU_MEMORY_USAGE_BYTES | TRUE | TRUE | |||||
| GPU_MEMORY_USAGE_BYTES_PER_GPU | TRUE | TRUE | |||||
| GPU_MEMORY_UTILIZATION | TRUE | TRUE | |||||
| GPU_MEMORY_UTILIZATION_PER_GPU | TRU | ||||||
| GPU_QUOTA | TRUE | TRUE | TRUE | TRUE | |||
| GPU_UTILIZATION | TRUE | TRUE | TRUE | TRUE | |||
| GPU_UTILIZATION_PER_GPU | TRUE | TRUE | |||||
| POD_COUNT | TRUE | ||||||
| RUNNING_POD_COUNT | TRUE | ||||||
| TOTAL_GPU | TRUE | TRUE | |||||
| TOTAL_GPU_NODES | TRUE | TRUE | |||||
| GPU_UTILIZATION_DISTRIBUTION | TRUE | TRUE | |||||
| UNALLOCATED_GPU | TRUE | TRUE | |||||
| CPU_QUOTA_MILLICORES | TRUE | TRUE | |||||
| CPU_MEMORY_QUOTA_MB | TRUE | TRUE | |||||
| CPU_ALLOCATION_MILLICORES | TRUE | TRUE | |||||
| CPU_MEMORY_ALLOCATION_MB | TRUE | TRUE | |||||
Advanced Metrics¶
NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.
| Metric | Cluster | Node Pool | Workload | Pod |
|---|---|---|---|---|
| GPU_FP16_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_FP32_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_FP64_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU | TRUE | |||
| GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_SM_ACTIVITY_PER_GPU | TRUE | |||
| GPU_SM_OCCUPANCY_PER_GPU | TRUE | |||
| GPU_TENSOR_ACTIVITY_PER_GPU | TRUE |
¶
Supported telemetry¶
| telemetry | Node | Workload | Project | Department |
|---|---|---|---|---|
| API | Node API | Workload API | ||
| WORKLOADS_COUNT | TRUE | |||
| ALLOCATED_GPUS | TRUE | TRUE | TRUE | TRUE |
| READY_GPU_NODES | TRUE | |||
| READY_GPUS | TRUE | |||
| TOTAL_GPU_NODES | TRUE | |||
| TOTAL_GPUS | TRUE | |||
| IDLE_ALLOCATED_GPUS | TRUE | |||
| FREE_GPUS | TRUE | |||
| TOTAL_CPU_CORES | TRUE | |||
| USED_CPU_CORES | TRUE | |||
| ALLOCATED_CPU_CORES | TRUE | TRUE | TRUE | |
| TOTAL_GPU_MEMORY_BYTES | TRUE | |||
| USED_GPU_MEMORY_BYTES | TRUE | |||
| TOTAL_CPU_MEMORY_BYTES | TRUE | |||
| USED_CPU_MEMORY_BYTES | TRUE | |||
| ALLOCATED_CPU_MEMORY_BYTES | TRUE | TRUE | TRUE | |
| GPU_QUOTA | TRUE | TRUE | ||
| CPU_QUOTA | TRUE | TRUE | ||
| MEMORY_QUOTA | TRUE | TRUE | ||
| GPU_ALLOCATION_NON_PREEMPTIBLE | TRUE | TRUE | ||
| CPU_ALLOCATION_NON_PREEMPTIBLE | TRUE | TRUE | ||
| MEMORY_ALLOCATION_NON_PREEMPTIBLE | TRUE | TRUE |