Skip to content

Quickstart: Launch an Inference Workload

Introduction

Machine learning (ML) inference refers to the process of using a trained machine learning model to make predictions or generate outputs based on new, unseen data. After a model has been trained on a dataset, inference involves applying this model to new examples to produce results such as classifications, predictions, or other types of insights.

The quickstart below shows an inference server running the model and an inference client.

There are various ways to submit a Workload:

  • Run:ai command-line interface (CLI)
  • Run:ai user interface
  • Run:ai API

At this time, Inference services cannot be created via the CLI. The CLI can be used for creating a client to query the inference service.

Prerequisites

To complete this Quickstart, the Infrastructure Administrator will need to install some optional inference prerequisites as described here.

To complete this Quickstart, the Platform Administrator will need to provide you with:

  • ML Engineer access to Project in Run:ai named "team-a"
  • The project should be assigned a quota of at least 1 GPU.
  • The URL of the Run:ai Console. E.g. https://acme.run.ai.

As described, the inference client can be created via CLI. To perform this, you will need to have the Run:ai CLI installed on your machine. There are two available CLI variants:

  • The older V1 CLI. See installation here
  • A newer V2 CLI, supported with clusters of version 2.18 and up. See installation here

Step by Step Walkthrough

Login

Run runai login and enter your credentials.

Run runai login and enter your credentials.

Browse to the provided Run:ai user interface and log in with your credentials.

To use the API, you will need to obtain a token. Please follow the api authentication article.

Create an Inference Server Environment

To complete this Quickstart via the UI, you will need to create a new Inference Server Environment asset.

This is a one-time step for all Inference workloads using the same image.

Under Environments Select NEW ENVIRONMENT. Then select:

  • A default (cluster) scope.
  • Use the environment name inference-server.
  • The image gcr.io/run-ai-demo/example-triton-server.
  • Under type of workload select inference.
  • Under endpoint set the container port as 8000 which is the port that the triton server is using.

Run an Inference Workload

Not available right now.

Not available right now.

  • In the Run:ai UI select Workloads
  • Select New Workload and then Inference
  • You should already have Cluster and Project selected. Enter inference-server-1 as the name and press CONTINUE.
  • Under Environment, select inference-server.
  • Under Compute Resource, select half-gpu.
  • Under `Replica autoscaling, select a minimum of 1 and a maximum of 2.
  • Under conditions for a new replica select Concurrency and set the value as 3.
  • Set the scale to zero option to 5 minutes
  • Select CREATE INFERENCE.

Note

For more information on submitting Workloads and creating Assets via the user interface, see Workload documentation.

curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \ # (1)
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ # (2)
-d '{ 
    "name": "inference-server-1", 
    "projectId": "<PROJECT-ID>", '\ # (3)
    "clusterId": "<CLUSTER-UUID>", \ # (4)
    "spec": {
        "image": "gcr.io/run-ai-demo/example-triton-server",
        "servingPort": {
            "protocol": "http",
            "container": 8000
        },
        "autoscaling": {
            "minReplicas": 1,
            "maxReplicas": 2,
            "metric": "concurrency",
            "metricThreshold": 3,
            "scaleToZeroRetentionSeconds": 300
        },
        "compute": {
            "cpuCoreRequest": 0.1,
            "gpuRequestType": "portion",
            "cpuMemoryRequest": "100M",
            "gpuDevicesRequest": 1,
            "gpuPortionRequest": 0.5
        }
    }
}'
  1. <COMPANY-URL> is the link to the Run:ai user interface. For example acme.run.ai
  2. <TOKEN> is an API access token. see above on how to obtain a valid token.
  3. <PROJECT-ID> is the the ID of the team-a Project. You can get the Project ID via the Get Projects API
  4. <CLUSTER-UUID> is the unique identifier of the Cluster. You can get the Cluster UUID by adding the "Cluster ID" column to the Clusters view.

Note

  • The above API snippet will only work with Run:ai clusters of 2.18 and above. For older clusters, use, the now deprecated Cluster API.
  • For more information on the Inference Submit API see API Documentation

This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.

Query the Inference Server

You can use the Run:ai Triton demo client to send requests to the server

Find the Inference Server Endpoint

  • Under Workloads, select Columns on the top right. Add the column Connections.
  • See the connections of the inference-server-1 workload:

  • Copy the inference endpoint URL.

Open a terminal and run:

runai config project team-a   
runai submit inference-client-1  -i gcr.io/run-ai-demo/example-triton-client \
-- perf_analyzer -m inception_graphdef  -p 3600000 -u  <INFERENCE-ENDPOINT>    

Open a terminal and run:

runai project set team-a
runai training submit inference-client-1  -i gcr.io/run-ai-demo/example-triton-client \
-- perf_analyzer -m inception_graphdef  -p 3600000 -u  <INFERENCE-ENDPOINT>    
  • In the Run:ai UI select Workloads
  • Select New Workload and then Training
  • You should already have Cluster, Project and a start from scratch Template selected. Enter inference-client-1 as the name and press CONTINUE.
  • Select NEW ENVIRONMENT. Enter inference-client as the name and gcr.io/run-ai-demo/example-triton-client as the image. Select CREATE ENVIRONMENT.
  • When the previous screen comes up, select cpu-only under the Compute resource.
  • Under runtime settings enter the command as perf_analyzer and arguments -m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT> (replace inference endpoint with the above URL).
  • Select CREATE TRAINING.

In the user interface, under inference-server-1, go to the Metrics tab and watch as the various GPU and inference metrics graphs rise.

Stop Workload

Run the following:

Not available right now

Not available right now

Select the two workloads and press DELETE.