Quickstart: Launch an Inference Workload¶

Introduction¶

Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.

With Inference, you are taking a trained Model and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.

Prerequisites¶

To complete this Quickstart you must have:

Run:ai software installed on your Kubernetes cluster. See: Installing Run:ai on a Kubernetes Cluster. There are additional prerequisites for running inference. See cluster installation prerequisites for more information.
Run:ai CLI installed on your machine. See: Installing the Run:ai Command-Line Interface
You must have ML Engineer access rights. See Adding, Updating and Deleting Users for more information.

Step by Step Walkthrough¶

Setup¶

Login to the Projects area of the Run:ai user interface.
Add a Project named "team-a".
Allocate 2 GPUs to the Project.

Run an Inference Workload¶

In the Run:ai user interface go to Deployments. If you do not see the Deployments section you may not have the required access control, or the inference module is disabled.
Select New Deployment on the top right.
Select team-a as a project and add an arbitrary name. Use the image gcr.io/run-ai-demo/example-triton-server.
Under Resources add 0.5 GPUs.
Under Auto Scaling select a minimum of 1, a maximum of 2. Use the concurrency autoscaling threshold method. Add a threshold of 3.
Add a Container port of 8000.

This would start an inference workload for team-a with an allocation of a single GPU. Follow up on the Job's progress using the Deployment list in the user interface or by running runai list jobs

Query the Inference Server¶

The specific inference server we just created is accepting queries over port 8000. You can use the Run:ai Triton demo client to send requests to the server:

Find an IP address by running kubectl get svc -n runai-team-a. Use the inference1-00001-private Cluster IP.
Replace <IP> below and run:

 runai submit inference-client  -i gcr.io/run-ai-demo/example-triton-client \
    -- perf_analyzer -m inception_graphdef  -p 3600000 -u  <IP>

To see the result, run the following:

runai logs inference-client

View status on the Run:ai User Interface¶

Open the Run:ai user interface.
Under Deployments you can view the new Workload. When clicking the workload, note the utilization graphs go up.

Stop Workload¶

Use the user interface to delete the workload.