Metrics
What are Metrics¶
Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.
The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai to enable customers to create custom dashboards or integrate metric data into other monitoring systems.
Run:ai uses Prometheus for collecting and querying metrics.
Published Run:ai Metrics¶
Following is the list of published Run:ai metrics:
Metric name | Labels | Measurement | Description |
---|---|---|---|
runai_active_job_cpu_requested_cores | {clusterId, job_name, job_uuid} | CPU Cores | Job's requested CPU cores |
runai_active_job_memory_requested_bytes | {clusterId, job_name, job_uuid} | Bytes | Job's requested CPU memory |
runai_cluster_cpu_utilization | {clusterId} | 0 to 1 | CPU utilization of the entire cluster |
runai_cluster_memory_used_bytes | {clusterId} | Bytes | Used CPU memory of the entire cluster |
runai_cluster_memory_utilization | {clusterId} | 0 to 1 | CPU memory utilization of the entire cluster |
runai_allocated_gpu_count_per_gpu | {gpu, clusterId, node} | 0/1 | Is a GPU hosting a pod |
runai_last_gpu_utilization_time_per_gpu | {gpu, clusterId, node} | Unix time | Last time GPU was not idle |
runai_gpu_utilization_non_ fractional_jobs | {job_uuid, job_name, clusterId, gpu, node} | 0 to 100 | Utilization per GPU for jobs running on a full GPU |
runai_gpu_utilization_per_pod_per_gpu | {pod_namespace, pod_name, clusterId, gpu, node} | 0 to 100 | GPU utilization per GPU |
runai_allocated_gpu_count_per_workload | {job_type, job_uuid, job_name, clusterId, project} | Double | GPUs allocated to Jobs |
runai_gpu_utilization_per_workload | {job_uuid, clusterId, job_name, project} | 0 to 100 | Average GPU utilization per job |
runai_job_image | {image, job_uuid, job_name, clusterId} | N/A | Image name per job |
runai_job_requested_gpu_memory | {job_type, job_uuid, job_name, clusterId, project} | Bytes | Requested GPU memory per job (0 if not specified by the user) |
runai_job_requested_gpus | {job_type, job_uuid, job_name, clusterId, project} | Double | Number of requested GPU per job |
runai_job_total_runtime | {clusterId, job_uuid} | Seconds | Total run time per job |
runai_job_total_wait_time | {clusterId, job_uuid} | Seconds | Total wait time per job |
runai_gpu_memory_used_mebibytes_per_workload | {clusterId, job_uuid} | Bytes | Used GPU memory per job |
runai_gpu_memory_used_mebibytes_per_pod_per_gpu | {job_uuid, job_name, clusterId, gpu, node} | Bytes | Used GPU memory per job, per GPU on which the job is running |
runai_node_cpu_requested_cores | {clusterId, node} | Double | Sum of the requested CPU cores of all jobs running in a node |
runai_node_cpu_utilization | {clusterId, node} | 0 to 1 | CPU utilization per node |
runai_node_gpu_used_memory_bytes | {clusterId, node} | Bytes | Used GPU memory per node |
runai_node_memory_utilization | {clusterId, node} | 0 to 1 | CPU memory utilization per node |
runai_node_requested_memory_bytes | {clusterId, node} | Bytes | Sum of the requested CPU memory of all jobs running in a node |
runai_node_total_memory_bytes | {clusterId, node} | Bytes | Total GPU memory per node |
runai_node_used_memory_bytes | {clusterId, node} | Bytes | Used CPU memory per node |
runai_project_guaranteed_gpus | {clusterId, project} | Double | Guaranteed GPU quota per project |
runai_project_info | {memory_quota, cpu_quota, gpu_guaranteed_quota, clusterId, project, department_name} | N/A | Information on CPU, CPU memory, GPU quota per project |
runai_active_job_cpu_limits | {clusterId, job_name , job_uuid} | Double | Jobs CPU limit (in number of cores). See link |
runai_active_job_cpu_requested_cores | {clusterId, job_name, job_uuid} | Double | Jobs requested CPU cores. See link |
runai_job_cpu_usage | {job_uuid, clusterId, job_name, project} | Double | Jobs CPU usage (in number of cores) |
runai_active_job_memory_limits | {clusterId, job_name, job_uuid} | Bytes | Jobs CPU memory limit. See link |
runai_running_job_memory_requested_bytes | {clusterId, job_name, job_uuid} | Bytes | Jobs requested CPU memory. See link |
runai_job_memory_used_bytes | {job_uuid, clusterId, job_name, project} | Bytes | Jobs used CPU memory |
runai_mig_mode_gpu_count | {clusterId, node} | Double | Number of GPUs on MIG nodes |
runai_job_swap_memory_used_bytes | {clusterId, job_uuid, job_name, project, node} | Bytes | Used Swap CPU memory for the job |
runai_deployment_request_rate | {clusterId, namespace_name, deployment_name} | Number | Rate of received HTTP requests per second |
runai_deployment_request_latencies | {clusterId, namespace_name, deployment_name, le} | Number | Histogram of response time (bins are in milliseconds) |
Additional metrics for version 2.9 and above
Metric name | Labels | Measurement | Description |
---|---|---|---|
runai_gpu_utilization_per_gpu | {clusterId, gpu, node} | % | GPU Utilization per GPU |
runai_gpu_utilization_per_node | {clusterId, node} | % | GPU Utilization per Node |
runai_gpu_memory_used_mebibytes_per_gpu | {clusterId, gpu, node} | MiB | Used GPU memory per GPU |
runai_gpu_memory_used_mebibytes_per_node | {clusterId, node} | MiB | Used GPU memory per Node |
runai_gpu_memory_total_mebibytes_per_gpu | {clusterId, gpu, node} | MiB | Total GPU memory per GPU |
runai_gpu_memory_total_mebibytes_per_node | {clusterId, node} | MiB | Toal GPU memory per Node |
runai_gpu_count_per_node | {clusterId, node} | Number | Number of GPUs per Node |
runai_allocated_gpu_count_per_workload | {clusterId, job_name, job_uuid, job_type, user} | Double | Number of allocated GPUs per Workload |
runai_allocated_gpu_count_per_project | {clusterId, project} | Double | Number of allocated GPUs per Project |
runai_gpu_memory_used_mebibytes_per_pod_per_gpu | {clusterId, pod_name, pod_uuid, pod_namespace, node, gpu} | MiB | Used GPU Memory per Pod per Gpu |
runai_gpu_memory_used_mebibytes_per_workload | {clusterId, job_name, job_uuid, job_type, user} | MiB | Used GPU Memory per Workload |
runai_gpu_utilization_per_pod_per_gpu | {clusterId, pod_name, pod_uuid, pod_namespace, node, gpu} | % | GPU Utilization per Pod per GPU |
runai_gpu_utilization_per_workload | {clusterId, job_name, job_uuid, job_type, user} | % | GPU Utilization per Workload |
runai_gpu_utilization_per_project | {clusterId, project} | % | GPU Utilization per Project |
runai_last_gpu_utilization_time_per_workload | {clusterId, job_name, job_uuid, job_type, user} | Seconds (Unix Timestamp) | The Last Time (Unix Timestamp) That The Workload Utilized Any Of His Allocated GPUs |
runai_gpu_idle_time_per_workload | {clusterId, job_name, job_uuid, job_type, user} | Seconds | Seconds Passed Since The Workload Utilized Any Of His Allocated GPUs |
runai_allocated_gpu_count_per_pod | {clusterId, pod_name, pod_uuid, pod_namespace, node} | Double | Number Of Allocated GPUs per Pod |
runai_allocated_gpu_count_per_node | {clusterId, node} | Double | Number Of Allocated GPUs per Node |
runai_allocated_millicpus_per_pod | {clusterId, pod_name, pod_uuid, pod_namespace, node} | Integer | Number Of Allocated Millicpus per Pod |
runai_allocated_memory_per_pod | {clusterId, pod_name, pod_uuid, pod_namespace, node} | Bytes | Allocated Memory per Pod |
Following is a list of labels appearing in Run:ai metrics:
Label | Description |
---|---|
clusterId | Cluster Identifier |
department_name | Name of Run:ai Department |
cpu_quota | CPU limit per project |
gpu | GPU index |
gpu_guaranteed_quota | Guaranteed GPU quota per project |
image | Name of Docker image |
namespace_name | Namespace |
deployment_name | Deployment name |
job_name | Job name |
job_type | Job type: training, interactive or inference |
job_uuid | Job identifier |
pod_name | Pod name. A Job can contain many pods. |
pod_namespace | Pod namespace |
memory_quota | CPU memory limit per project |
node | Node name |
project | Name of Run:ai Project |
status | Job status: Running, Pending, etc. For more information on Job statuses see document |
user | User identifier |
Other Metrics¶
Run:ai exports other metrics emitted by NVIDIA and Kubernetes packages, as follows:
Metric name | Description |
---|---|
runai_gpu_utilization_per_gpu | GPU utilization |
kube_node_status_capacity | The capacity for different resources of a node |
kube_node_status_condition | The condition of a cluster node |
kube_pod_container_resource_requests_cpu_cores | The number of CPU cores requested by container |
kube_pod_container_resource_requests_memory_bytes | Bytes of memory requested by a container |
kube_pod_info | Information about pod |
For additional information, see Kubernetes kube-state-metrics and NVIDIA dcgm exporter.
Create custom dasbhoards¶
To create custom dashboards based on the above metrics, please contact Run:ai customer support.