Metrics

What are Metrics¶

Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.

The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai to enable customers to create custom dashboards or integrate metric data into other monitoring systems.

Run:ai uses Prometheus for collecting and querying metrics.

Note

From cluster version 2.17 and onwards, Run:ai will support metrics via the Run:ai API and direct metrics queries (metrics that are queried directly from Prometheus) will be deprecated.

Published Run:ai Metrics¶

Following is the list of published Run:ai metrics, per cluster version (make sure to pick the right cluster version in the picker at the top of the page):

Metric name	Labels	Measurement	Description
runai_active_job_cpu_requested_cores	{clusterId, job_name, job_uuid}	CPU Cores	Workload's requested CPU cores
runai_active_job_memory_requested_bytes	{clusterId, job_name, job_uuid}	Bytes	Workload's requested CPU memory
runai_cluster_cpu_utilization	{clusterId}	0 to 1	CPU utilization of the entire cluster
runai_cluster_memory_used_bytes	{clusterId}	Bytes	Used CPU memory of the entire cluster
runai_cluster_memory_utilization	{clusterId}	0 to 1	CPU memory utilization of the entire cluster
runai_allocated_gpu_count_per_gpu	{gpu, clusterId, node}	0/1	Is a GPU hosting a pod
runai_last_gpu_utilization_time_per_gpu	{gpu, clusterId, node}	Unix time	Last time GPU was not idle
runai_requested_gpu_memory_mb_per_workload	{clusterId, job_type, job_uuid, job_name, project, workload_id}	MegaBytes	Requested GPU memory per workload (0 if not specified by the user)
runai_requested_gpus_per_workload	{clusterId, workload_type, workload_id, workload_name, project}	Double	Number of requested GPUs per workload
runai_run_time_seconds_per_workload	{clusterId, workload_id, workload_name}	Seconds	Total run time per workload
runai_wait_time_seconds_per_workload	{clusterId, workload_id, workload_name}	Seconds	Total wait time per workload
runai_node_cpu_requested_cores	{clusterId, node}	Double	Sum of the requested CPU cores of all workloads running in a node
runai_node_cpu_utilization	{clusterId, node}	0 to 1	CPU utilization per node
runai_node_memory_utilization	{clusterId, node}	0 to 1	CPU memory utilization per node
runai_node_requested_memory_bytes	{clusterId, node}	Bytes	Sum of the requested CPU memory of all workloads running in a node
runai_node_used_memory_bytes	{clusterId, node}	Bytes	Used CPU memory per node
runai_project_guaranteed_gpus	{clusterId, project}	Double	Guaranteed GPU quota per project
runai_project_info	{memory_quota, cpu_quota, gpu_guaranteed_quota, clusterId, project, department}	N/A	Information on CPU, CPU memory, GPU quota per project
runai_queue_info	{memory_quota, cpu_quota, gpu_guaranteed_quota, clusterId, nodepool, queue_name, department}	N/A	Information on CPU, CPU memory, GPU quota per project/department per nodepool
runai_cpu_limits_per_active_workload	{clusterId, job_name , job_uuid}	CPU Cores	Workloads CPU limit (in number of cores). See link
runai_job_cpu_usage	{clusterId, workload_id, workload_name, project}	Double	Workloads CPU usage (in number of cores)
runai_memory_limits_per_active_workload	{clusterId, job_name, job_uuid}	Bytes	Workloads CPU memory limit. See link
runai_active_job_memory_requested_bytes	{clusterId, job_name, job_uuid}	Bytes	Workloads requested CPU memory. See link
runai_job_memory_used_bytes	{clusterId, workload_id, workload_name, project}	Bytes	Workloads used CPU memory
runai_mig_mode_gpu_count	{clusterId, node}	Double	Number of GPUs on MIG nodes
runai_gpu_utilization_per_gpu	{clusterId, gpu, node}	%	GPU Utilization per GPU
runai_gpu_utilization_per_node	{clusterId, node}	%	GPU Utilization per Node
runai_gpu_memory_used_mebibytes_per_gpu	{clusterId, gpu, node}	MiB	Used GPU memory per GPU
runai_gpu_memory_used_mebibytes_per_node	{clusterId, node}	MiB	Used GPU memory per Node
runai_gpu_memory_total_mebibytes_per_gpu	{clusterId, gpu, node}	MiB	Total GPU memory per GPU
runai_gpu_memory_total_mebibytes_per_node	{clusterId, node}	MiB	Total GPU memory per Node
runai_gpu_count_per_node	{clusterId, node, modelName, ready, schedulable}	Number	Number of GPUs per Node
runai_allocated_gpu_count_per_workload	{clusterId, workload_id, workload_name, workload_type, user}	Double	Number of allocated GPUs per Workload
runai_allocated_gpu_count_per_project	{clusterId, project}	Double	Number of allocated GPUs per Project
runai_gpu_memory_used_mebibytes_per_pod_per_gpu	{clusterId, pod_name, pod_uuid, pod_namespace, node, gpu}	MiB	Used GPU Memory per Pod, per Gpu on which the workload is running
runai_gpu_memory_used_mebibytes_per_workload	{clusterId, workload_id, workload_name, workload_type, user}	MiB	Used GPU Memory per Workload
runai_gpu_utilization_per_pod_per_gpu	{clusterId, pod_name, pod_uuid, pod_namespace, node, gpu}	%	GPU Utilization per Pod per GPU
runai_gpu_utilization_per_workload	{clusterId, workload_id, workload_name, workload_type, user}	%	Average GPU Utilization per Workload
runai_gpu_utilization_per_project	{clusterId, project}	%	Average GPU Utilization per Project
runai_last_gpu_utilization_time_per_workload	{clusterId, workload_id, workload_name, workload_type, user}	Seconds (Unix Timestamp)	The Last Time (Unix Timestamp) That The Workload Utilized Any Of Its Allocated GPUs
runai_gpu_idle_seconds_per_workload	{clusterId, workload_id, workload_name, workload_type, user}	Seconds	Seconds Passed Since The Workload Utilized Any Of Its Allocated GPUs
runai_allocated_gpu_count_per_pod	{clusterId, pod_name, pod_uuid, pod_namespace, node}	Double	Number Of Allocated GPUs per Pod
runai_allocated_gpu_count_per_node	{clusterId, node}	Double	Number Of Allocated GPUs per Node
runai_allocated_millicpus_per_pod	{clusterId, pod_name, pod_uuid, pod_namespace, node}	Integer	Number Of Allocated Millicpus per Pod
runai_allocated_memory_per_pod	{clusterId, pod_name, pod_uuid, pod_namespace, node}	Bytes	Allocated Memory per Pod

Following is a list of labels appearing in Run:ai metrics:

Label	Description
clusterId	Cluster Identifier
department	Name of Run:ai Department
cpu_quota	CPU limit per project
gpu	GPU index
gpu_guaranteed_quota	Guaranteed GPU quota per project
image	Name of Docker image
namespace_name	Namespace
deployment_name	Deployment name
job_name	Job name
job_type	Job type: training, interactive or inference
job_uuid	Job identifier
workload_name	Workload name
workload_type	Workload type: training, interactive or inference
workload_uuid	Workload identifier
pod_name	Pod name. A Workload can contain many pods.
pod_namespace	Pod namespace
memory_quota	CPU memory limit per project
node	Node name
project	Name of Run:ai Project
status	Workload status: Running, Pending, etc. For more information on Workload statuses see document
user	User identifier

Other Metrics¶

Run:ai exports other metrics emitted by NVIDIA and Kubernetes packages, as follows:

Metric name	Description
runai_gpu_utilization_per_gpu	GPU utilization
kube_node_status_capacity	The capacity for different resources of a node
kube_node_status_condition	The condition of a cluster node
kube_pod_container_resource_requests_cpu_cores	The number of CPU cores requested by container
kube_pod_container_resource_requests_memory_bytes	Bytes of memory requested by a container
kube_pod_info	Information about pod

For additional information, see Kubernetes kube-state-metrics and NVIDIA dcgm exporter.

Changed metrics and API mapping¶

Starting in cluster version 2.17, some of the metrics names have been changed. In addition some Run:ai metrics are available as API endpoints. Using the API endpoints is more efficient and provides an easier way of retrieving metrics in any application. The following table lists the metrics that were changed.

Metric name in version 2.16	2.17 Change Description	2.17 API Endpoint
runai_active_job_cpu_requested_cores	available also via API	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_REQUEST_CORES" metricType
runai_active_job_memory_requested_bytes	available also via API	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_MEMORY_REQUEST_BYTES" metricType
runai_cluster_cpu_utilization	available also via API	https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU_UTILIZATION" metricType
runai_cluster_memory_utilization	available also via API	https://app.run.ai/api/v2/clusters/{clusterUuid}/metrics ; with "CPU_MEMORY_UTILIZATION" metricType
runai_gpu_utilization_non_fractional_jobs	no longer available
runai_allocated_gpu_count_per_workload	labels changed
runai_gpu_utilization_per_pod_per_gpu	available also via API	https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU_UTILIZATION_PER_GPU" metricType
runai_gpu_utilization_per_workload	available also via API and labels changed	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU_UTILIZATION" metricType
runai_job_image	no longer available
runai_job_requested_gpu_memory	available also via API and renamed to: "runai_requested_gpu_memory_mb_per_workload" with different labels	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU_MEMORY_REQUEST_BYTES" metricType
runai_job_requested_gpus	renamed to: "runai_requested_gpus_per_workload" with different labels
runai_job_total_runtime	renamed to: "runai_run_time_seconds_per_workload" with different labels
runai_job_total_wait_time	renamed to: "runai_wait_time_seconds_per_workload" with different labels
runai_gpu_memory_used_mebibytes_per_workload	available also via API and labels changed	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "GPU_MEMORY_USAGE_BYTES" metricType
runai_gpu_memory_used_mebibytes_per_pod_per_gpu	available also via API and labels changed	https://app.run.ai/api/v1/workloads/{workloadId}/pods/{podId}/metrics ; with "GPU_MEMORY_USAGE_BYTES_PER_GPU" metricType
runai_node_gpu_used_memory_bytes	renamed and changed units: "runai_gpu_memory_used_mebibytes_per_node"
runai_node_total_memory_bytes	renamed and changed units: "runai_gpu_memory_total_mebibytes_per_node"
runai_project_info	labels changed
runai_active_job_cpu_limits	available also via API and renamed to: "runai_cpu_limits_per_active_workload"	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_LIMIT_CORES" metricType
runai_job_cpu_usage	available also via API and labels changed	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_USAGE_CORES" metricType
runai_active_job_memory_limits	available also via API and renamed to: "runai_memory_limits_per_active_workload"	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_MEMORY_LIMIT_BYTES" metricType
runai_running_job_memory_requested_bytes	was a duplication of "runai_active_job_memory_requested_bytes", see above
runai_job_memory_used_bytes	available also via API and labels changed	https://app.run.ai/api/v1/workloads/{workloadId}/metrics ; with "CPU_MEMORY_USAGE_BYTES" metricType
runai_job_swap_memory_used_bytes	no longer available
runai_gpu_count_per_node	added labels
runai_last_gpu_utilization_time_per_workload	labels changed
runai_gpu_idle_time_per_workload	renamed to: "runai_gpu_idle_seconds_per_workload" with different labels

Create custom dashboards¶

To create custom dashboards based on the above metrics, please contact Run:ai customer support.