Quickstart: Launch an Inference Workload¶
Introduction¶
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
With Inference, you are taking a trained Model and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
Prerequisites¶
To complete this Quickstart you must have:
- Run:ai software installed on your Kubernetes cluster. See: Installing Run:ai on a Kubernetes Cluster. There are additional prerequisites for running inference. See cluster installation prerequisites for more information.
- Run:ai CLI installed on your machine. See: Installing the Run:ai Command-Line Interface
- You must have ML Engineer access rights. See Adding, Updating and Deleting Users for more information.
Step by Step Walkthrough¶
Setup¶
- Login to the Projects area of the Run:ai user interface.
- Add a Project named "team-a".
- Allocate 2 GPUs to the Project.
Run an Inference Workload¶
- In the Run:ai user interface go to
Deployments
. If you do not see theDeployments
section you may not have the required access control, or the inference module is disabled. - Select
New Deployment
on the top right. - Select
team-a
as a project and add an arbitrary name. Use the imagegcr.io/run-ai-demo/example-triton-server
. - Under
Resources
add 0.5 GPUs. - Under
Auto Scaling
select a minimum of 1, a maximum of 2. Use theconcurrency
autoscaling threshold method. Add a threshold of 3. - Add a
Container port
of8000
.
This would start an inference workload for team-a with an allocation of a single GPU. Follow up on the Job's progress using the Deployment list in the user interface or by running runai list jobs
Query the Inference Server¶
The specific inference server we just created is accepting queries over port 8000. You can use the Run:ai Triton demo client to send requests to the server:
- Find an IP address by running
kubectl get svc -n runai-team-a
. Use theinference1-00001-private
Cluster IP. - Replace
<IP>
below and run:
runai submit inference-client -i gcr.io/run-ai-demo/example-triton-client \
-- perf_analyzer -m inception_graphdef -p 3600000 -u <IP>
- To see the result, run the following:
View status on the Run:ai User Interface¶
- Open the Run:ai user interface.
- Under Deployments you can view the new Workload. When clicking the workload, note the utilization graphs go up.
Stop Workload¶
Use the user interface to delete the workload.
See also¶
- You can also create Inference deployments via API. For more information see Submitting Workloads via YAML.
- See Deployment user interface.