Quickstart: Launch an Inference Workload¶
Introduction¶
Machine learning (ML) inference refers to the process of using a trained machine learning model to make predictions or generate outputs based on new, unseen data. After a model has been trained on a dataset, inference involves applying this model to new examples to produce results such as classifications, predictions, or other types of insights.
The quickstart below shows an inference server running the model and an inference client.
There are various ways to submit a Workload:
- Run:ai command-line interface (CLI)
- Run:ai user interface
- Run:ai API
At this time, Inference services cannot be created via the CLI. The CLI can be used for creating a client to query the inference service.
Prerequisites¶
To complete this Quickstart, the Infrastructure Administrator will need to install some optional inference prerequisites as described here.
To complete this Quickstart, the Platform Administrator will need to provide you with:
- ML Engineer access to Project in Run:ai named "team-a"
- The project should be assigned a quota of at least 1 GPU.
- The URL of the Run:ai Console. E.g. https://acme.run.ai.
As described, the inference client can be created via CLI. To perform this, you will need to have the Run:ai CLI installed on your machine. There are two available CLI variants:
- The older V1 CLI. See installation here
- A newer V2 CLI, supported with clusters of version 2.18 and up. See installation here
Step by Step Walkthrough¶
Login¶
Run runai login
and enter your credentials.
Run runai login
and enter your credentials.
Browse to the provided Run:ai user interface and log in with your credentials.
To use the API, you will need to obtain a token. Please follow the api authentication article.
Create an Inference Server Environment¶
To complete this Quickstart via the UI, you will need to create a new Inference Server Environment asset.
This is a one-time step for all Inference workloads using the same image.
Under Environments
Select NEW ENVIRONMENT. Then select:
- A default (cluster) scope.
- Use the environment name
inference-server
. - The image
gcr.io/run-ai-demo/example-triton-server
. - Under
type of workload
selectinference
. - Under
endpoint
set the container port as8000
which is the port that the triton server is using.
Run an Inference Workload¶
Not available right now.
Not available right now.
- In the Run:ai UI select Workloads
- Select New Workload and then Inference
- You should already have
Cluster
andProject
selected. Enterinference-server-1
as the name and press CONTINUE. - Under
Environment
, selectinference-server
. - Under
Compute Resource
, selecthalf-gpu
. - Under `Replica autoscaling, select a minimum of 1 and a maximum of 2.
- Under
conditions for a new replica
selectConcurrency
and set the value as 3. - Set the
scale to zero
option to5 minutes
- Select CREATE INFERENCE.
Note
For more information on submitting Workloads and creating Assets via the user interface, see Workload documentation.
curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \ # (1)
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ # (2)
-d '{
"name": "inference-server-1",
"projectId": "<PROJECT-ID>", '\ # (3)
"clusterId": "<CLUSTER-UUID>", \ # (4)
"spec": {
"image": "gcr.io/run-ai-demo/example-triton-server",
"servingPort": {
"protocol": "http",
"container": 8000
},
"autoscaling": {
"minReplicas": 1,
"maxReplicas": 2,
"metric": "concurrency",
"metricThreshold": 3,
"scaleToZeroRetentionSeconds": 300
},
"compute": {
"cpuCoreRequest": 0.1,
"gpuRequestType": "portion",
"cpuMemoryRequest": "100M",
"gpuDevicesRequest": 1,
"gpuPortionRequest": 0.5
}
}
}'
<COMPANY-URL>
is the link to the Run:ai user interface. For exampleacme.run.ai
<TOKEN>
is an API access token. see above on how to obtain a valid token.<PROJECT-ID>
is the the ID of theteam-a
Project. You can get the Project ID via the Get Projects API<CLUSTER-UUID>
is the unique identifier of the Cluster. You can get the Cluster UUID by adding the "Cluster ID" column to the Clusters view.
Note
- The above API snippet will only work with Run:ai clusters of 2.18 and above. For older clusters, use, the now deprecated Cluster API.
- For more information on the Inference Submit API see API Documentation
This would start a triton inference server with a maximum of 2 instances, each instance consumes half a GPU.
Query the Inference Server¶
You can use the Run:ai Triton demo client to send requests to the server
Find the Inference Server Endpoint¶
- Under
Workloads
, selectColumns
on the top right. Add the columnConnections
. - See the connections of the
inference-server-1
workload:
- Copy the inference endpoint URL.
Open a terminal and run:
Open a terminal and run:
- In the Run:ai UI select Workloads
- Select New Workload and then Training
- You should already have
Cluster
,Project
and astart from scratch
Template
selected. Enterinference-client-1
as the name and press CONTINUE. - Select NEW ENVIRONMENT. Enter
inference-client
as the name andgcr.io/run-ai-demo/example-triton-client
as the image. Select CREATE ENVIRONMENT. - When the previous screen comes up, select
cpu-only
under the Compute resource. - Under
runtime settings
enter the command asperf_analyzer
and arguments-m inception_graphdef -p 3600000 -u <INFERENCE-ENDPOINT>
(replace inference endpoint with the above URL). - Select CREATE TRAINING.
In the user interface, under inference-server-1
, go to the Metrics
tab and watch as the various GPU and inference metrics graphs rise.
Stop Workload¶
Run the following:
Not available right now
Not available right now
Select the two workloads and press DELETE.