Skip to content

Quickstart: Launch an Inference Workload


Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.

With Inference, you are taking a trained Model and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.

For further information on Inference at Run:AI, see Inference overview.


To complete this Quickstart you must have:

Step by Step Walkthrough


  • Login to the Projects area of the Run:AI Administration user interface at
  • Add a Project named "team-a".
  • Allocate 2 GPUs to the Project.

Run an Inference Workload - Single Replica

  • At the command-line run:
runai config project team-a
runai submit --name inference1 --service-type nodeport --port 8888 --inference \
    -i  -g 1

This would start an inference workload for team-a with an allocation of a single GPU. The inference workload is based on a sample docker image The inference engine used in this quickstart is Marian

  • Follow up on the Job's progress by running:
    runai list jobs

The result:


The output shows the service URL with which to connect to the service.

Query the Inference Server

The specific Marian server is accepting queries over the WebSockets protocol. You can use the Run:AI Marian sample client.

In the following command, replace <HOSTNAME> and <PORT> with the service URL displayed in the previous list command:

runai submit inference-client -i \
    -- --hostname <HOSTNAME> --port <PORT> --processes 1 

To see the result, run the following:

runai logs inference-client -f

You should see a log of the inference call:


View status on the Run:AI User Interface


Stop Workload

Run the following:

runai delete inference1

This would stop the inference workload. Verify this by running runai list jobs again.

Run an Inference Workload - Multiple Replicas

At the command-line run:

runai submit inference2 --service-type nodeport --port 8888 --inference \
    -i  --replicas 4 -g 0.25 

This will create 4 replicas of the same service that will each run with 25% of the GPU Memory at a total of 1 GPU for all replicas.

Last update: August 6, 2021