Quickstart: Launch an Inference Workload¶
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
With Inference, you are taking a trained Model and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
To complete this Quickstart you must have:
- Run:ai software installed on your Kubernetes cluster. See: Installing Run:ai on a Kubernetes Cluster. There are additional prerequisites for running inference. See cluster installation prerequisites for more information.
- Run:ai CLI installed on your machine. See: Installing the Run:ai Command-Line Interface
- You must have ML Engineer access rights. See Adding, Updating and Deleting Users for more information.
Step by Step Walkthrough¶
- Login to the Projects area of the Run:ai user interface.
- Add a Project named "team-a".
- Allocate 2 GPUs to the Project.
Run an Inference Workload¶
- In the Run:ai user interface go to
Deployments. If you do not see the
Deploymentssection you may not have the required access control, or the inference module is disabled.
New Deploymenton the top right.
team-aas a project and add an arbitrary name. Use the image
Resourcesadd 0.5 GPUs.
Auto Scalingselect a minimum of 1, a maximum of 2. Use the
concurrencyautoscaling threshold method. Add a threshold of 3.
- Add a
This would start an inference workload for team-a with an allocation of a single GPU. Follow up on the Job's progress using the Deployment list in the user interface or by running
runai list jobs
Query the Inference Server¶
The specific inference server we just created is accepting queries over port 8000. You can use the Run:ai Triton demo client to send requests to the server:
- Find an IP address by running
kubectl get svc -n runai-team-a. Use the
<IP>below and run:
runai submit inference-client -i gcr.io/run-ai-demo/example-triton-client \ -- perf_analyzer -m inception_graphdef -p 3600000 -u <IP>
- To see the result, run the following:
View status on the Run:ai User Interface¶
- Open the Run:ai user interface.
- Under Deployments you can view the new Workload. When clicking the workload, note the utilization graphs go up.
Use the user interface to delete the workload.
- You can also create Inference deployments via API. For more information see Submitting Workloads via YAML.
- See Deployment user interface.