Compute Resources
This article explains what compute resources are and how to create and use them.
Compute resources are one type of workload asset. A compute resource is a template that simplifies how workloads are submitted and can be used by AI practitioners when they submit their workloads.
A compute resource asset is a preconfigured building block that encapsulates all the specifications of compute requirements for the workload including:
- GPU devices and GPU memory
- CPU memory and CPU compute
Compute resource table¶
The Compute resource table can be found under Compute resources in the Run:ai UI.
The Compute resource table provides a list of all the compute resources defined in the platform and allows you to manage them.
The Compute resource table consists of the following columns:
Column | Description |
---|---|
Compute resource | The name of the compute resource |
Description | A description of the essence of the compute resource |
GPU devices request per pod | The number of requested physical devices per pod of the workload that uses this compute resource |
GPU memory request per device | The amount of GPU memory per requested device that is granted to each pod of the workload that uses this compute resource |
CPU memory request | The minimum amount of CPU memory per pod of the workload that uses this compute resource |
CPU memory limit | The maximum amount of CPU memory per pod of the workload that uses this compute resource |
CPU compute request | The minimum number of CPU cores per pod of the workload that uses this compute resource |
CPU compute limit | The maximum number of CPU cores per pod of the workload that uses this compute resource |
Scope | The scope of this compute resource within the organizational tree. Click the name of the scope to view the organizational tree diagram |
Workload(s) | The list of workloads associated with the compute resource |
Template(s) | The list of workload templates that use this compute resource |
Created by | The name of the user who created the compute resource |
Creation time | The timestamp for when the rule was created |
Cluster | The cluster that the compute resource is associated with |
Workloads associated with the compute resource¶
Click one of the values in the Workload(s) column to view the list of workloads and their parameters.
Column | Description |
---|---|
Workload | The workload that uses the compute resource |
Type | (Workspace/Training/Inference) |
Status | Represents the workload lifecycle. see the full list of workload status |
Customizing the table view¶
- Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
- Search - Click SEARCH and type the value to search by
- Sort - Click each column header to sort by
- Column selection - Click COLUMNS and select the columns to display in the table
Adding new compute resource¶
To add a new compute resource:
- Go to the Compute resource table
- Click +NEW COMPUTE RESOURCE
- Select under which cluster to create the compute resource
- Select a scope
- Enter a name for the compute resource. The name must be unique.
- Optional: Provide a description of the essence of the compute resource
-
Set the resource types needed within a single node
(The Run:ai scheduler tries to match a single node that complies with the compute resource for each of the workload’s pods)-
GPU
- GPU devices per pod
The number of devices (physical GPUs) per pod
(for example, if you requested 3 devices per pod and the running workload using this compute resource consists of 3 pods, there are 9 physical GPU devices used in total)
Note
- You can insert a whole number of devices (0; 1; 2; 3; …)
- When setting it to zero, the workload using this computer resource neither requests or uses GPU resources while running
- Only when setting it to 1, a fraction of a GPU memory can be requested
- When setting a number higher than 1, the entire GPU memory of the devices is used by the running workloads
- GPU memory per device
- Select the memory request format
- % (of device) - Fraction of a GPU device’s memory
- MB (memory size) - An explicit GPU memory unit
- GB (memory size) - An explicit GPU memory unit
- Multi-instance GPU (MIG) - MIG profile (Deprecated)
- Set the memory Request - The minimum amount of GPU memory that is provisioned per device. This means that any pod of a running workload that uses this compute resource, receives this amount of GPU memory for each device(s) the pod utilizes
- Optional: Set the memory Limit - The maximum amount of GPU memory that is provisioned per device. This means that any pod of a running workload that uses this compute resource, receives at most this amount of GPU memory for each device(s) the pod utilizes.
To set a Limit, first enable the limit toggle. The limit value must be equal to or higher than the request.
- Select the memory request format
Note
- GPU memory limit is disabled by default. If you cannot see the Limit toggle in the compute resource form, then it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization
- When a Limit is set and is bigger than the Request, the scheduler allows each pod to reach the maximum amount of GPU memory in an opportunistic manner (only upon availability).
- If the GPU Memory Limit is bigger that the Request the pod is prone to be killed by the Run:ai toolkit (out of memory signal). The greater the difference between the GPU memory used and the request, the higher the risk of being killed
- If GPU resource optimization is turned off, the minimum and maximum are in fact equal
- GPU devices per pod
-
CPU
- CPU compute per pod
- Select the units for the CPU compute (Cores / Millicores)
- Set the CPU compute Request - the minimum amount of CPU compute that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives this amount of CPU compute for each pod.
- Optional: Set the CPU compute Limit - The maximum amount of CPU compute that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives at most this amount of CPU compute.
To set a Limit, first enable the limit toggle. The limit value must be equal to or higher than the request.
By default, the limit is set to “Unlimited” - which means that the pod may consume all the node's free CPU compute resources.
- CPU memory per pod
- Select the units for the CPU memory (MB / GB)
- Set the CPU memory Request - The minimum amount of CPU memory that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives this amount of CPU memory for each pod.
- Optional: Set the CPU memory Limit - The maximum amount of CPU memory that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives at most this amount of CPU memory.
To set a Limit, first enable the limit toggle. The limit value must be equal to or higher than the request.
By default, the limit is set to “Unlimited” - Meaning that the pod may consume all the node's free CPU memory resources.
Note
If the CPU Memory Limit is bigger that the Request the pod is prone to be killed by the operating system (out of memory signal). The greater the difference between the CPU memory used and the request, the higher the risk of being killed.
- CPU compute per pod
-
-
Optional: More settings
- Increase shared memory size
When enabled, the shared memory size available to the pod is increased from the default 64MB to the node's total available memory or the CPU memory limit, if set above. - Set extended resource(s)
Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the Extended resources and Quantity guides
- Increase shared memory size
-
Click CREATE COMPUTE RESOURCE
Note
It is also possible to add compute resources directly when creating a specific Workspace, training or inference workload.
Editing a compute resource¶
To edit a compute resource:
- Select the compute resource from the table
- Click RENAME to edit its name and description
Note
Additional fields can be edited using the API.
Copying & editing a compute resource¶
To copy & edit a compute resource:
- Select the compute resource you want to duplicate
- Click COPY & EDIT
- Update the compute resource and click CREATE COMPUTE RESOURCE
Deleting a compute resource¶
- Select the compute resource you want to delete
- Click DELETE
- On the dialog, click DELETE to confirm the deletion
Note
It is not possible to delete a compute resource being used by an existing workload and template.
Using API¶
Go to the Compute resources API reference to view the available actions