Quickstart: Launch an Inference Workload¶
Introduction¶
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
With Inference, you are taking a trained Model and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
For further information on Inference at Run:AI, see Inference overview.
Prerequisites¶
To complete this Quickstart you must have:
- Run:AI software installed on your Kubernetes cluster. See: Installing Run:AI on an on-premise Kubernetes Cluster
- Run:AI CLI installed on your machine. See: Installing the Run:AI Command-Line Interface
Step by Step Walkthrough¶
Setup¶
- Login to the Projects area of the Run:AI Administration user interface at https://app.run.ai/projects
- Add a Project named "team-a".
- Allocate 2 GPUs to the Project.
Run an Inference Workload - Single Replica¶
- At the command-line run:
runai config project team-a
runai submit --name inference1 --service-type nodeport --port 8888 --inference \
-i gcr.io/run-ai-demo/quickstart-inference-marian -g 1
This would start an inference workload for team-a with an allocation of a single GPU. The inference workload is based on a sample docker image gcr.io/run-ai-demo/quickstart-inference-marian
. The inference engine used in this quickstart is Marian
- Follow up on the Job's progress by running:
runai list jobs
The result:
The output shows the service URL with which to connect to the service.
Query the Inference Server¶
The specific Marian
server is accepting queries over the WebSockets protocol. You can use the Run:AI Marian sample client.
In the following command, replace <HOSTNAME>
with the service URL displayed in the previous list
command:
runai submit inference-client -i gcr.io/run-ai-demo/quickstart-inference-marian-client \
-- --hostname <HOSTNAME> --port 32717 --processes 1
To see the result, run the following:
runai logs inference-client -f
You should see a log of the inference call:
View status on the Run:AI User Interface¶
- Go to https://app.run.ai/jobs
- Under "Jobs" you can view the new Workload:
Stop Workload¶
Run the following:
runai delete inference1
This would stop the inference workload. Verify this by running runai list jobs
again.
Run an Inference Workload - Multiple Replicas¶
At the command-line run:
runai submit inference2 --service-type nodeport --port 8888 --inference \
-i gcr.io/run-ai-demo/quickstart-inference-marian --replicas 4 -g 0.25
This will create 4 replicas of the same service that will each run with 25% of the GPU Memory at a total of 1 GPU for all replicas.