Deploy a custom inference workload¶
This article explains how to create a custom inference workload via the Run:ai UI.
An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.
The inference workload is assigned to a project and is affected by the project’s quota.
To learn more about the inference workload type in Run:ai and determine that it is the most suitable workload type for your goals, see Workload types.
Creating a custom inference workload¶
Before you start, make sure you have a project.
To add a new custom inference workload:
- Go to the Workload manager → Workloads
- Click +NEW WORKLOAD and select Inference
Within the new inference form: - Select under which cluster to create the inference workload
- Select the project in which your inference will run
- Select custom inference from Inference type
- Enter a unique name for the inference workload (if the name already exists in the project, you will be requested to submit a different name)
- Click CONTINUE
In the next step: -
Select the environment for your inference workload
- Select an environment or click +NEW ENVIRONMENT to add a new environment to the gallery.
For a step-by-step guide on adding environments to the gallery, see Environments. Once created, the new environment will be automatically selected. -
Set an inference serving endpoint. The connection protocol and the container port are defined within the environment
-
Optional: Modify who can access the endpoint
-
Public (default)
Everyone within the network can access the endpoint with no authentication
-
All authenticated users
Everyone within the organization’s account that can log in (to Run:ai or SSO)
-
Specific group(s)
- Click +GROUP
- Enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the endpoint.
-
Specific user(s)
- Click +USER
- Enter a valid email address or username. If you remove yourself, you will lose access to the endpoint.
-
-
-
Set the connection for your tool(s). The tools are configured as part of the environment.
- External URL
- Custom URL
- Set the URL
- Optional: Modify who can access the tool:
- All authenticated users (default)
Everyone within the organization’s account - Specific group(s)
- Click +GROUP
- Enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.
- Specific user(s)
- Click +USER
- Enter a valid email address or username. If you remove yourself, you will lose access to the tool.
- All authenticated users (default)
- Custom URL
- Node port
- Custom port
- Set the node port (enter a port between 30000 and 32767; if the node port is already in use, the workload will fail and display an error message)
- Custom port
- External URL
- Optional: Set the command and arguments for the container running the workload
If no command is added, the container will use the image’s default command (entry-point).- Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.
- Set multiple arguments separated by spaces, using the following format (e.g.:
--arg1=val1
).
- Set the environment variable(s)
- Modify the existing environment variable(s). The existing environment variables may include instructions to guide you with entering the correct values.
- (Optional) Add new variables
- Click +ENVIRONMENT VARIABLE
- Enter a name
- Select the source for the environment variable
- Custom
- Enter a value according to the provided instructions
- Credentials - Select existing credentials as the environment variable
- Select a credential name To add new credentials to the credentials list, and for additional information, see Credentials.
- Select a secret key
- Custom
- Select an environment or click +NEW ENVIRONMENT to add a new environment to the gallery.
-
Select the compute resource for your inference workload
- Select a compute resource or click +NEW COMPUTE RESOURCE to add a new compute resource to the gallery.
For a step-by-step guide on adding compute resources to the gallery, see compute resources. Once created, the new compute resource will be automatically selected. - Optional: Set the minimum and maximum number of replicas to be scaled up and down to meet the changing demands of inference services.
-
If the number of minimum and maximum replicas are different, autoscaling will be triggered and you'll need to set conditions for creating a new replica. A replica will be created every time a condition is met. When a condition is no longer met after a replica was created, the replica will be automatically deleted to save resources.
- Select a variable - The variable's values will be monitored via the container's port.
- Latency (milliseconds)
- Throughput (Requests/sec)
- Concurrency (Requests)
- Set a value - This value is the threshold at which autoscaling is triggered.
- Select a variable - The variable's values will be monitored via the container's port.
-
Optional: Set when the replicas should be automatically scaled down to zero. This allows compute resources to be freed up when the model is inactive (i.e., there are no requests being sent) When automatic scaling to zero is enabled, the minimum number of replicas set in the previous step, automatically changes to 0
- Optional: Set the order of priority for the node pools on which the scheduler tries to run the workload.
When a workload is created, the scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the scheduler will move on to the next one until it finds one that is available.- Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.
- Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster.
To configure a new node pool and for additional information, see node pools.
- Select a node affinity to schedule the workload on a specific node type.
If the administrator added a ‘node type (affinity)’ scheduling rule to the project/department, then this field is mandatory.
Otherwise, entering a node type (affinity) is optional. Nodes must be tagged with a label that matches the node type key and value. -
Optional: Set toleration(s) to allow the workload to be scheduled on a node with a matching taint
Note
Tolerations are disabled, by default. If you cannot see Tolerations in the menu, then it must be enabled by your Administrator, under General settings → Workloads → Tolerations
- Click +TOLERATION
- Enter a key
- Select the operator
- Exists - If the key exists on the node, the effect will be applied.
- Equals - If the key and the value set below matches to the value on the node, the effect will be applied
- Enter a value matching the value on the node
- Select the effect for the toleration
- NoExecute - Pods that do not tolerate this taint are evicted immediately.
- NoSchedule - No new pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node will not be evicted.
- PreferNoSchedule - The control plane will try to avoid placing a pod that does not tolerate the taint on the node, but it is not guaranteed.
- Any - All effects above match.
- Optional: Select data sources for your inference workload Select a data source or click +NEW DATA SOURCE to add a new data source to the gallery. If there are issues with the connectivity to the cluster, or issues while creating the data source, the data source won't be available for selection.
For a step-by-step guide on adding data sources to the gallery, see data sources.
Once created, the new data source will be automatically selected.- Optional: Modify the data target location for the selected data source(s).
- Select a compute resource or click +NEW COMPUTE RESOURCE to add a new compute resource to the gallery.
-
Optional - General settings:
- Set the timeframe for auto-deletion after workload completion or failure. The time after which a completed or failed workload is deleted; if this field is set to 0 seconds, the workload will be deleted automatically.
- Set annotations(s)
Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.- Click +ANNOTATION
- Enter a name
- Enter a value
- Set labels(s)
Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.- Enter a name
- Enter a value
- Click CREATE INFERENCE
Managing and monitoring¶
After the inference workload is created, it is added to the Workloads table, where it can be managed and monitored.
Using API¶
To view the available actions, see the Inferences API reference.