Skip to content

Metrics

What are Metrics

Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.

The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai to enable customers to create custom dashboards or integrate metric data into other monitoring systems.

Run:ai uses Prometheus for collecting and querying metrics.

Note

From cluster version 2.17 and onwards, Run:ai will support metrics via the Run:ai API and direct metrics queries (metrics that are queried directly from Prometheus) will be deprecated.

Published Run:ai Metrics

Following is the list of published Run:ai metrics, per cluster version (make sure to pick the right cluster version in the picker at the top of the page):

Metric name Labels Measurement Description
runai_active_job_cpu_requested_cores {clusterId, job_name, job_uuid} CPU Cores Workload's requested CPU cores
runai_active_job_memory_requested_bytes {clusterId, job_name, job_uuid} Bytes Workload's requested CPU memory
runai_cluster_cpu_utilization {clusterId} 0 to 1 CPU utilization of the entire cluster
runai_cluster_memory_used_bytes {clusterId} Bytes Used CPU memory of the entire cluster
runai_cluster_memory_utilization {clusterId} 0 to 1 CPU memory utilization of the entire cluster
runai_allocated_gpu_count_per_gpu {gpu, clusterId, node} 0/1 Is a GPU hosting a pod
runai_last_gpu_utilization_time_per_gpu {gpu, clusterId, node} Unix time Last time GPU was not idle
runai_requested_gpu_memory_mb_per_workload {clusterId, job_type, job_uuid, job_name, project, workload_id} MegaBytes Requested GPU memory per workload (0 if not specified by the user)
runai_requested_gpus_per_workload {clusterId, workload_type, workload_id, workload_name, project} Double Number of requested GPUs per workload
runai_run_time_seconds_per_workload {clusterId, workload_id, workload_name} Seconds Total run time per workload
runai_wait_time_seconds_per_workload {clusterId, workload_id, workload_name} Seconds Total wait time per workload
runai_node_cpu_requested_cores {clusterId, node} Double Sum of the requested CPU cores of all workloads running in a node
runai_node_cpu_utilization {clusterId, node} 0 to 1 CPU utilization per node
runai_node_memory_utilization {clusterId, node} 0 to 1 CPU memory utilization per node
runai_node_requested_memory_bytes {clusterId, node} Bytes Sum of the requested CPU memory of all workloads running in a node
runai_node_used_memory_bytes {clusterId, node} Bytes Used CPU memory per node
runai_project_guaranteed_gpus {clusterId, project} Double Guaranteed GPU quota per project
runai_project_info {memory_quota, cpu_quota, gpu_guaranteed_quota, clusterId, project, department} N/A Information on CPU, CPU memory, GPU quota per project
runai_queue_info {memory_quota, cpu_quota, gpu_guaranteed_quota, clusterId, nodepool, queue_name, department} N/A Information on CPU, CPU memory, GPU quota per project/department per nodepool
runai_cpu_limits_per_active_workload {clusterId, job_name , job_uuid} CPU Cores Workloads CPU limit (in number of cores). See link
runai_job_cpu_usage {clusterId, workload_id, workload_name, project} Double Workloads CPU usage (in number of cores)
runai_memory_limits_per_active_workload {clusterId, job_name, job_uuid} Bytes Workloads CPU memory limit. See link
runai_active_job_memory_requested_bytes {clusterId, job_name, job_uuid} Bytes Workloads requested CPU memory. See link
runai_job_memory_used_bytes {clusterId, workload_id, workload_name, project} Bytes Workloads used CPU memory
runai_mig_mode_gpu_count {clusterId, node} Double Number of GPUs on MIG nodes
runai_gpu_utilization_per_gpu {clusterId, gpu, node} % GPU Utilization per GPU
runai_gpu_utilization_per_node {clusterId, node} % GPU Utilization per Node
runai_gpu_memory_used_mebibytes_per_gpu {clusterId, gpu, node} MiB Used GPU memory per GPU
runai_gpu_memory_used_mebibytes_per_node {clusterId, node} MiB Used GPU memory per Node
runai_gpu_memory_total_mebibytes_per_gpu {clusterId, gpu, node} MiB Total GPU memory per GPU
runai_gpu_memory_total_mebibytes_per_node {clusterId, node} MiB Total GPU memory per Node
runai_gpu_count_per_node {clusterId, node, modelName, ready, schedulable} Number Number of GPUs per Node
runai_allocated_gpu_count_per_workload {clusterId, workload_id, workload_name, workload_type, user} Double Number of allocated GPUs per Workload
runai_allocated_gpu_count_per_project {clusterId, project} Double Number of allocated GPUs per Project
runai_gpu_memory_used_mebibytes_per_pod_per_gpu {clusterId, pod_name, pod_uuid, pod_namespace, node, gpu} MiB Used GPU Memory per Pod, per Gpu on which the workload is running
runai_gpu_memory_used_mebibytes_per_workload {clusterId, workload_id, workload_name, workload_type, user} MiB Used GPU Memory per Workload
runai_gpu_utilization_per_pod_per_gpu {clusterId, pod_name, pod_uuid, pod_namespace, node, gpu} % GPU Utilization per Pod per GPU
runai_gpu_utilization_per_workload {clusterId, workload_id, workload_name, workload_type, user} % Average GPU Utilization per Workload
runai_gpu_utilization_per_project {clusterId, project} % Average GPU Utilization per Project
runai_last_gpu_utilization_time_per_workload {clusterId, workload_id, workload_name, workload_type, user} Seconds (Unix Timestamp) The Last Time (Unix Timestamp) That The Workload Utilized Any Of Its Allocated GPUs
runai_gpu_idle_seconds_per_workload {clusterId, workload_id, workload_name, workload_type, user} Seconds Seconds Passed Since The Workload Utilized Any Of Its Allocated GPUs
runai_allocated_gpu_count_per_pod {clusterId, pod_name, pod_uuid, pod_namespace, node} Double Number Of Allocated GPUs per Pod
runai_allocated_gpu_count_per_node {clusterId, node} Double Number Of Allocated GPUs per Node
runai_allocated_millicpus_per_pod {clusterId, pod_name, pod_uuid, pod_namespace, node} Integer Number Of Allocated Millicpus per Pod
runai_allocated_memory_per_pod {clusterId, pod_name, pod_uuid, pod_namespace, node} Bytes Allocated Memory per Pod

Following is a list of labels appearing in Run:ai metrics:

Label Description
clusterId Cluster Identifier
department Name of Run:ai Department
cpu_quota CPU limit per project
gpu GPU index
gpu_guaranteed_quota Guaranteed GPU quota per project
image Name of Docker image
namespace_name Namespace
deployment_name Deployment name
job_name Job name
job_type Job type: training, interactive or inference
job_uuid Job identifier
workload_name Workload name
workload_type Workload type: training, interactive or inference
workload_uuid Workload identifier
pod_name Pod name. A Workload can contain many pods.
pod_namespace Pod namespace
memory_quota CPU memory limit per project
node Node name
project Name of Run:ai Project
status Workload status: Running, Pending, etc. For more information on Workload statuses see document
user User identifier

Other Metrics

Run:ai exports other metrics emitted by NVIDIA and Kubernetes packages, as follows:

Metric name Description
runai_gpu_utilization_per_gpu GPU utilization
kube_node_status_capacity The capacity for different resources of a node
kube_node_status_condition The condition of a cluster node
kube_pod_container_resource_requests_cpu_cores The number of CPU cores requested by container
kube_pod_container_resource_requests_memory_bytes Bytes of memory requested by a container
kube_pod_info Information about pod

For additional information, see Kubernetes kube-state-metrics and NVIDIA dcgm exporter.

Changed metrics and API mapping

Starting in cluster version 2.17, some of the metrics names have been changed. In addition some Run:ai metrics are available as API endpoints. Using the API endpoints is more efficient and provides an easier way of retrieving metrics in any application. The following table lists the metrics that were changed.

Metric name in version 2.16 2.17 Change Description 2.17 API Endpoint
runai_active_job_cpu_requested_cores available also via API https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_REQUEST_CORES" metricType
runai_active_job_memory_requested_bytes available also via API https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_MEMORY_REQUEST_BYTES" metricType
runai_cluster_cpu_utilization available also via API https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU_UTILIZATION" metricType
runai_cluster_memory_utilization available also via API https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU_MEMORY_UTILIZATION" metricType
runai_gpu_utilization_non_fractional_jobs no longer available
runai_allocated_gpu_count_per_workload labels changed
runai_gpu_utilization_per_pod_per_gpu available also via API https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU_UTILIZATION_PER_GPU" metricType
runai_gpu_utilization_per_workload available also via API and labels changed https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU_UTILIZATION" metricType
runai_job_image no longer available
runai_job_requested_gpu_memory available also via API and renamed to: "runai_requested_gpu_memory_mb_per_workload" with different labels https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU_MEMORY_REQUEST_BYTES" metricType
runai_job_requested_gpus renamed to: "runai_requested_gpus_per_workload" with different labels
runai_job_total_runtime renamed to: "runai_run_time_seconds_per_workload" with different labels
runai_job_total_wait_time renamed to: "runai_wait_time_seconds_per_workload" with different labels
runai_gpu_memory_used_mebibytes_per_workload available also via API and labels changed https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU_MEMORY_USAGE_BYTES" metricType
runai_gpu_memory_used_mebibytes_per_pod_per_gpu available also via API and labels changed https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU_MEMORY_USAGE_BYTES_PER_GPU" metricType
runai_node_gpu_used_memory_bytes renamed and changed units: "runai_gpu_memory_used_mebibytes_per_node"
runai_node_total_memory_bytes renamed and changed units: "runai_gpu_memory_total_mebibytes_per_node"
runai_project_info labels changed
runai_active_job_cpu_limits available also via API and renamed to: "runai_cpu_limits_per_active_workload" https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_LIMIT_CORES" metricType
runai_job_cpu_usage available also via API and labels changed https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_USAGE_CORES" metricType
runai_active_job_memory_limits available also via API and renamed to: "runai_memory_limits_per_active_workload" https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_MEMORY_LIMIT_BYTES" metricType
runai_running_job_memory_requested_bytes was a duplication of "runai_active_job_memory_requested_bytes", see above
runai_job_memory_used_bytes available also via API and labels changed https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_MEMORY_USAGE_BYTES" metricType
runai_job_swap_memory_used_bytes no longer available
runai_gpu_count_per_node added labels
runai_last_gpu_utilization_time_per_workload labels changed
runai_gpu_idle_time_per_workload renamed to: "runai_gpu_idle_seconds_per_workload" with different labels

Create custom dashboards

To create custom dashboards based on the above metrics, please contact Run:ai customer support.