Skip to content

Submit an inference Workload via YAML

Parameters:

  • <WORKLOAD-NAME>. The name of the Workload. The name must comply with Kubernetes naming conventions for DNS Label names. With fractional workloads, the name is limited to 18 characters.
  • <IMAGE-NAME>. The name of the docker image to use. Example: gcr.io/run-ai-demo/quickstart-inference-marian
  • <USER-NAME> The name of the user submitting the Workload. The name is used for display purposes only when Run:AI is installed in an unauthenticated mode.
  • <REQUESTED-GPUs>. An integer number of GPUs you request to be allocated for the Workload. Examples: 1, 2

Submit Inference Workloads Allocating Full GPUs

Copy the following into a file while substituting the parameters:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <WORKLOAD-NAME>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <WORKLOAD-NAME>
  template:
    metadata:
      labels:
        app: <WORKLOAD-NAME>
      annotations:
        user: <USER-NAME>
    spec:
      schedulerName: runai-scheduler
      containers:
        - image: <IMAGE-NAME>
          name: <WORKLOAD-NAME>
          ports:
            - containerPort: <TARGET-PORT>
          resources:
            limits:
              nvidia.com/gpu: <REQUESTED-GPUs>
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app:  <WORKLOAD-NAME>
  name:  <WORKLOAD-NAME>
spec:
  type: NodePort
  ports:
    - port: <TARGET-PORT>
      targetPort: <TARGET-PORT>
  selector:
    app: <WORKLOAD-NAME>

Note

This example also contains the creation of a service. The service is used to connect to the inference server. It is not mandatory, but for most inference cases the service will be needed as well.

To submit the Workload, run:

kubectl apply -f <FILE-NAME>

Submit Inference Workloads Allocating Fractions of a GPU

Workloads with Fractions require a change in the above YAML. Specifically, the limits section:

limits:
  nvidia.com/gpu: <REQUESTED-GPUs>

should be omitted and replaced with:

spec:
  template: 
    metadata:
      annotations:
        gpu-fraction: "0.5"

Workloads with NVIDIA MPS require a change in the above YAML.

spec:
  template: 
    metadata:
      annotations:
        mps: "true"

Important

To use MPS, your administrator must first enable it. See the setup document.

Delete Workloads

To delete a Run:AI Inference workload, delete the Workload:

kubectl delete runaijob <WORKLOAD-NAME>

Last update: April 5, 2021