Dynamic GPU Fractions
Introduction¶
Many AI workloads are using GPU resources intermittently and sometimes these resources are not used at all. These AI workloads need these resources when they are running AI applications, or debugging a model in development. Other workloads such as Inference, might be using GPU resources at a lower utilization rate than requested, and may suddenly ask for higher guaranteed resources at peak utilization times.
This pattern of resource request vs. actual resource utilization causes lower utilization of GPUs. This mainly happens if there are many workloads requesting resources to match their peak demand, even though the majority of the time they operate far below that peak.
Run:ai has introduced Dynamic GPU fractions in v2.15 to cope with resource request vs. actual resource utilization which enables users to optimize GPU resource usage.
Dynamic GPU fractions is part of Run:ai's core capabilities to enable workloads to optimize the use of GPU resources. This works by providing the ability to specify and consume GPU memory and compute resources dynamically by leveraging Kubernetes Request and Limit notations.
Dynamic GPU fractions allow a workload to request a guaranteed fraction of GPU memory or GPU compute resource (similar to a Kubernetes request), and at the same time also request the ability to grow beyond that guaranteed request up to a specific limit (similar to a Kubernetes limit), if the resources are available.
For example, with Dynamic GPU Fractions, a user can specify a workload with a GPU fraction Request of 0.25 GPU, and add the parameter gpu-fraction-limit
of up to 0.80 GPU. The cluster/node-pool scheduler schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the parameter gpu-fraction-limit
), where only 0.25 of the GPU memory is guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.
Run:ai automatically manages the state changes between request
and Limit
as well as the reverse (when the balance need to be "returned"), updating the metrics and workloads' states and graphs.
Setting Fractional GPU Memory Limit¶
With the fractional GPU memory limit, users can submit workloads using GPU fraction Request
and Limit
.
You can either:
-
Use a GPU Fraction parameter (use the
gpu-fraction
annotation)or
-
Use an absolute GPU Memory parameter (
gpu-memory
annotation)
When setting a GPU memory limit either as GPU fraction, or GPU memory size, the Limit
must be equal or greater than the GPU fraction memory request.
Both GPU fraction and GPU memory are translated into the actual requested memory size of the Request (guaranteed resources) and the Limit (burstable resources).
To guarantee fair quality of service between different workloads using the same GPU, Run:ai developed an extendable GPU OOMKiller
(Out Of Memory Killer) component that guarantees the quality of service using Kubernetes semantics for resources Request and Limit.
The OOMKiller
capability requires adding CAP_KILL
capabilities to the Dynamic GPU fraction and to the Run:ai core scheduling module (toolkit daemon). This capability is disabled by default.
To change the state of Dynamic GPU Fraction in the cluster, edit the runaiconfig
file and set:
To set the gpu memory limit per workload, add the RUNAI_GPU_MEMORY_LIMIT
environment variable to the first container in the pod. This is the GPU consuming container.
To use RUNAI_GPU_MEMORY_LIMIT
environment variable:
-
Submit a workload yaml directly, and set the
RUNAI_GPU_MEMORY_LIMIT
environment variable. -
Create a policy, per Project or globally. For example, set all Interactive workloads of
Project=research_vision1
to always set the environment variable ofRUNAI_GPU_MEMORY_LIMIT
to 1. -
Pass the environment variable through the CLI or the UI.
The supported values depend on the label used. You can use them in either the UI or the CLI. Use only one of the variables in the following table (they cannot be mixed):
Variable | Input format |
---|---|
gpu-fraction | A fraction value (for example: 0.25, 0.75). |
gpu-memory | Kubernetes resources quantity which must be larger than gpu-memory . For example, 500000000, 2500M, 4G. NOTE: The gpu-memory label values are always in MB, unlike the env variable. |
Compute Resources UI with Dynamic Fractions support¶
To enable the UI elements for Dynamic Fractions, press Settings, General, then open the Resources pane and toggle GPU Resource Optimization. This enables all the UI features related to GPU Resource Optimization for the whole tenant. There are other per cluster or per node-pool configurations that should be configured in order to use the capabilities of ‘GPU Resource Optimization’ See the documentation for each of these features. Once the ‘GPU Resource Optimization’ feature is enabled, you will be able to create Compute Resources with the GPU Portion (Fraction) Limit and GPU Memory Limit. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the Metrics pane for each workload.
Note
When setting a workload with Dynamic Fractions, (for example, when using it with GPU Request or GPU memory Limits), you practically make the workload burstable. This means it can use memory that is not guaranteed for that workload and is susceptible to an ‘OOM Kill’ signal if the actual owner of that memory requires it back. This applies to non-preemptive workloads as well. For that reason, its recommended that you use Dynamic Fractions with Interactive workloads running Notebooks. Notebook pods are not evicted when their GPU process is OOM Kill’ed. This behavior is the same as standard Kubernetes burstable CPU workloads.
Multi-GPU Dynamic Fractions¶
Run:ai also supports workload submission using multi-GPU dynamic fractions. Multi-GPU dynamic fractions work similarly to dynamic fractions on a single GPU workload, however, instead of a single GPU device, the Run:ai Scheduler allocates the same dynamic fraction pair (Request and Limit) on multiple GPU devices within the same node. For example, if practitioners develop a new model that uses 8 GPUs and requires 40GB of memory per GPU, but may want to burst out and consume up to the full GPU memory, they can allocate 8×40GB with multi-GPU fractions and a limit of 80GB (e.g. H100 GPU) instead of reserving the full memory of each GPU (e.g. 80GB). This leaves 40GB of GPU memory available on each of the 8 GPUs for other workloads within that node.This is useful during model development, where memory requirements are usually lower due to experimentation with smaller model or configurations.
This approach significantly improves GPU utilization and availability, enabling more precise and often smaller quota requirements for the end user. Time sharing where single GPUs can serve multiple workloads with dynamic fractions remains unchanged, only now, it serves multiple workloads using multi-GPU per workload.
Configuring Multi-GPU Dynamic Fractions¶
You can configure multi-GPU dynamic fractions as follows:
- Using the compute resources asset, you can define the compute requirement to run multiple GPU devices, by specifying either a fraction (percentage) of the overall memory or specifying the memory request (GB, MB), both with Request and Limit parameters:
- You can submit a workload with dynamic fractions using the CLI V2: