Skip to content

Metrics

What are Metrics

Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.

The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai to enable customers to create custom dashboards or integrate metric data into other monitoring systems.

Run:ai uses Prometheus for collecting and querying metrics.

Published Run:ai Metrics

Following is the list of published Run:ai metrics:

Metric name Labels Measurement Description
runai_active_job_cpu_requested_cores {clusterId, job_name, job_uuid} CPU Cores Job's requested CPU cores
runai_active_job_memory_requested_bytes {clusterId, job_name, job_uuid} Bytes Job's requested CPU memory
runai_cluster_cpu_utilization  {clusterId} 0 to 1 CPU utilization of the entire cluster
runai_cluster_memory_used_bytes  {clusterId} Bytes Used CPU memory of the entire cluster
runai_cluster_memory_utilization  {clusterId} 0 to 1 CPU memory utilization of the entire cluster
runai_allocated_gpu_count_per_gpu  {gpu, clusterId, node} 0/1 Is a GPU hosting a pod
runai_last_gpu_utilization_time_per_gpu  {gpu, clusterId, node} Unix time Last time GPU was not idle
runai_gpu_utilization_non_
fractional_jobs
 {job_uuid, job_name, clusterId, gpu, node} 0 to 100 Utilization per GPU for jobs running on a full GPU
runai_gpu_utilization_per_pod_per_gpu  {pod_namespace, pod_name, clusterId, gpu, node} 0 to 100 GPU utilization per GPU
runai_allocated_gpu_count_per_workload  {job_type, job_uuid, job_name, clusterId, project} Double GPUs allocated to Jobs
runai_gpu_utilization_per_workload  {job_uuid, clusterId, job_name, project} 0 to 100 Average GPU utilization per job
runai_job_image  {image, job_uuid, job_name, clusterId} N/A Image name per job
runai_job_requested_gpu_memory  {job_type, job_uuid, job_name, clusterId, project} Bytes Requested GPU memory per job (0 if not specified by the user)
runai_job_requested_gpus  {job_type, job_uuid, job_name, clusterId, project} Double Number of requested GPU per job
runai_job_total_runtime  {clusterId, job_uuid} Seconds Total run time per job
runai_job_total_wait_time  {clusterId, job_uuid} Seconds Total wait time per job
runai_gpu_memory_used_mebibytes_per_workload  {clusterId, job_uuid} Bytes Used GPU memory per job
runai_gpu_memory_used_mebibytes_per_pod_per_gpu  {job_uuid, job_name, clusterId, gpu, node} Bytes Used GPU memory per job, per GPU on which the job is running
runai_node_cpu_requested_cores  {clusterId, node} Double Sum of the requested CPU cores of all jobs running in a node
runai_node_cpu_utilization  {clusterId, node} 0 to 1 CPU utilization per node
runai_node_gpu_used_memory_bytes  {clusterId, node} Bytes Used GPU memory per node
runai_node_memory_utilization  {clusterId, node} 0 to 1 CPU memory utilization per node
runai_node_requested_memory_bytes  {clusterId, node} Bytes Sum of the requested CPU memory of all jobs running in a node
runai_node_total_memory_bytes  {clusterId, node} Bytes Total GPU memory per node
runai_node_used_memory_bytes  {clusterId, node} Bytes Used CPU memory per node
runai_project_guaranteed_gpus  {clusterId, project} Double Guaranteed GPU quota per project
runai_project_info  {memory_quota, cpu_quota, gpu_guaranteed_quota, clusterId, project, department_name} N/A Information on CPU, CPU memory, GPU quota per project
runai_active_job_cpu_limits {clusterId, job_name , job_uuid} Double Jobs CPU limit (in number of cores). See link
runai_active_job_cpu_requested_cores  {clusterId, job_name, job_uuid} Double Jobs requested CPU cores. See link
runai_job_cpu_usage  {job_uuid, clusterId, job_name, project} Double Jobs CPU usage (in number of cores)
runai_active_job_memory_limits  {clusterId, job_name, job_uuid} Bytes Jobs CPU memory limit. See link
runai_running_job_memory_requested_bytes  {clusterId, job_name, job_uuid} Bytes Jobs requested CPU memory. See link
runai_job_memory_used_bytes  {job_uuid, clusterId, job_name, project} Bytes Jobs used CPU memory
runai_mig_mode_gpu_count  {clusterId, node} Double Number of GPUs on MIG nodes
runai_job_swap_memory_used_bytes {clusterId, job_uuid, job_name, project, node} Bytes Used Swap CPU memory for the job
runai_deployment_request_rate {clusterId, namespace_name, deployment_name} Number Rate of received HTTP requests per second
runai_deployment_request_latencies {clusterId, namespace_name, deployment_name, le} Number Histogram of response time (bins are in milliseconds)

Additional metrics for version 2.9 and above

Metric name Labels Measurement Description
runai_gpu_utilization_per_gpu {clusterId, gpu, node} % GPU Utilization per GPU
runai_gpu_utilization_per_node {clusterId, node} % GPU Utilization per Node
runai_gpu_memory_used_mebibytes_per_gpu {clusterId, gpu, node} MiB Used GPU memory per GPU
runai_gpu_memory_used_mebibytes_per_node {clusterId, node} MiB Used GPU memory per Node
runai_gpu_memory_total_mebibytes_per_gpu {clusterId, gpu, node} MiB Total GPU memory per GPU
runai_gpu_memory_total_mebibytes_per_node {clusterId, node} MiB Toal GPU memory per Node
runai_gpu_count_per_node {clusterId, node} Number Number of GPUs per Node
runai_allocated_gpu_count_per_workload {clusterId, job_name, job_uuid, job_type, user} Double Number of allocated GPUs per Workload
runai_allocated_gpu_count_per_project {clusterId, project} Double Number of allocated GPUs per Project
runai_gpu_memory_used_mebibytes_per_pod_per_gpu {clusterId, pod_name, pod_uuid, pod_namespace, node, gpu} MiB Used GPU Memory per Pod per Gpu
runai_gpu_memory_used_mebibytes_per_workload {clusterId, job_name, job_uuid, job_type, user} MiB Used GPU Memory per Workload
runai_gpu_utilization_per_pod_per_gpu {clusterId, pod_name, pod_uuid, pod_namespace, node, gpu} % GPU Utilization per Pod per GPU
runai_gpu_utilization_per_workload {clusterId, job_name, job_uuid, job_type, user} % GPU Utilization per Workload
runai_gpu_utilization_per_project {clusterId, project} % GPU Utilization per Project
runai_last_gpu_utilization_time_per_workload {clusterId, job_name, job_uuid, job_type, user} Seconds (Unix Timestamp) The Last Time (Unix Timestamp) That The Workload Utilized Any Of His Allocated GPUs
runai_gpu_idle_time_per_workload {clusterId, job_name, job_uuid, job_type, user} Seconds Seconds Passed Since The Workload Utilized Any Of His Allocated GPUs
runai_allocated_gpu_count_per_pod {clusterId, pod_name, pod_uuid, pod_namespace, node} Double Number Of Allocated GPUs per Pod
runai_allocated_gpu_count_per_node {clusterId, node} Double Number Of Allocated GPUs per Node
runai_allocated_millicpus_per_pod {clusterId, pod_name, pod_uuid, pod_namespace, node} Integer Number Of Allocated Millicpus per Pod
runai_allocated_memory_per_pod {clusterId, pod_name, pod_uuid, pod_namespace, node} Bytes Allocated Memory per Pod

Following is a list of labels appearing in Run:ai metrics:

Label Description
clusterId Cluster Identifier
department_name Name of Run:ai Department
cpu_quota CPU limit per project
gpu GPU index
gpu_guaranteed_quota Guaranteed GPU quota per project
image Name of Docker image
namespace_name Namespace
deployment_name Deployment name
job_name Job name
job_type Job type: training, interactive or inference
job_uuid Job identifier
pod_name Pod name. A Job can contain many pods.
pod_namespace Pod namespace
memory_quota CPU memory limit per project
node Node name
project Name of Run:ai Project
status Job status: Running, Pending, etc. For more information on Job statuses see document
user User identifier

Other Metrics

Run:ai exports other metrics emitted by NVIDIA and Kubernetes packages, as follows:

Metric name Description
runai_gpu_utilization_per_gpu GPU utilization
kube_node_status_capacity The capacity for different resources of a node
kube_node_status_condition The condition of a cluster node
kube_pod_container_resource_requests_cpu_cores The number of CPU cores requested by container
kube_pod_container_resource_requests_memory_bytes Bytes of memory requested by a container
kube_pod_info Information about pod

For additional information, see Kubernetes kube-state-metrics and NVIDIA dcgm exporter.

Create custom dasbhoards

To create custom dashboards based on the above metrics, please contact Run:ai customer support.