Metrics via API
What are Metrics¶
Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.
The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai to enable customers to create custom dashboards or integrate metric data into other monitoring systems.
Run:ai provides metrics via the Run:ai Control-plane API. In the past, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.
Metric Scopes¶
Run:ai provides Control-plane API which supports and aggregates metrics at various levels.
Level | Description |
---|---|
Cluster | Cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level |
Node Pool | Metrics are aggregated at the Node Pool level. |
Workload | Metrics are aggregated at the Workload level. Some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods |
Pod | The basic execution unit |
Supported Metrics¶
Metric | Cluster | Node Pool | Workload | Pod |
---|---|---|---|---|
API | Cluster API | Node Pool API | Workload API | Pod API |
ALLOCATED_GPU | TRUE | TRUE | ||
AVG_WORKLOAD_WAIT_TIME | TRUE | TRUE | ||
CPU_LIMIT_CORES | TRUE | |||
CPU_MEMORY_LIMIT_BYTES | TRUE | |||
CPU_MEMORY_REQUEST_BYTES | TRUE | |||
CPU_MEMORY_USAGE_BYTES | TRUE | TRUE | ||
CPU_MEMORY_UTILIZATION | TRUE | TRUE | ||
CPU_REQUEST_CORES | TRUE | |||
CPU_USAGE_CORES | TRUE | TRUE | ||
CPU_UTILIZATION | TRUE | TRUE | ||
GPU_ALLOCATION | TRUE | |||
GPU_MEMORY_REQUEST_BYTES | TRUE | |||
GPU_MEMORY_USAGE_BYTES | TRUE | TRUE | ||
GPU_MEMORY_USAGE_BYTES_PER_GPU | TRUE | |||
GPU_MEMORY_UTILIZATION | TRUE | TRUE | ||
GPU_QUOTA | TRUE | TRUE | ||
GPU_UTILIZATION | TRUE | TRUE | TRUE | TRUE |
GPU_UTILIZATION_PER_GPU | TRUE | |||
POD_COUNT | TRUE | |||
RUNNING_POD_COUNT | TRUE | |||
TOTAL_GPU | TRUE | TRUE |
Advanced Metrics¶
NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.
Metric | Cluster | Node Pool | Workload | Pod |
---|---|---|---|---|
GPU_FP16_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_FP32_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_FP64_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU | TRUE | |||
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
GPU_SM_ACTIVITY_PER_GPU | TRUE | |||
GPU_SM_OCCUPANCY_PER_GPU | TRUE | |||
GPU_TENSOR_ACTIVITY_PER_GPU | TRUE |