Skip to content

Submit a Run:AI Job via YAML

The easiest way to submit Jobs to the Run:AI GPU cluster is via the Run:AI Command-line interface (CLI). Still, the CLI is not a must. It is only a wrapper for a more detailed Kubernetes API syntax using YAML.

There are cases where you want to forgo the CLI and use direct YAML calls. A frequent scenario for using the Kubernetes YAML syntax to submit Jobs is integrations. Researchers may already be working with an existing system that submits Jobs, and want to continue working with the same system. Though it is possible to call the Run:AI CLI from the customer's integration, it is sometimes not enough.

Terminology

We differentiate between three types of Workloads:

  • Train workloads. Train workloads are characterized by a deep learning session that has a start and an end. A Training session can take anywhere from a few minutes to a couple of weeks. It can be interrupted in the middle and later restored. Training workloads typically utilize large percentages of GPU computing power and memory.
  • Build workloads. Build workloads are interactive. They are used by data scientists to write machine learning code and test it against subsets of the data. Build workloads typically do not maximize usage of the GPU.
  • Inference workloads. Inference workloads are used for serving models in production. For details on how to submit Inference workloads via YAML see here.

The internal Kubernetes implementation of Train is a CRD (Customer Resource) named RunaiJob which is similar to a Kubernetes Job. The internal implementation of Buils is a Kubernetes StatesfulSet.

  • A Kubernetes Job is used for Train workloads. A Job has a distinctive "end" at which time the Job is either "Completed" or "Failed"
  • A Kubernetes StatefulSet is used for Build workloads. Build workloads are interactive sessions. StatefulSets do not end on their own. Instead, they must be manually stopped

Run:AI extends the Kubernetes Scheduler. A Kubernetes Scheduler is the software that determines which workload to start on which node. Run:AI provides a custom scheduler named runai-scheduler.

The Run:AI scheduler schedules computing resources by associating Workloads with Run:AI Projects:

  • A Project is assigned with a GPU quota through the Run:AI Administrator user interface.
  • A workload must be associated with a Project name and will receive resources according to the defined quota for the Project and the currently running Workloads

Internally, Run:AI Projects are implemented as Kubernetes namespaces. The scripts below assume that the code is being run after the relevant namespace has been set.

Submit Workloads

  • <JOB-NAME>. The name of the Job.

  • <IMAGE-NAME>. The name of the docker image to use. Example: gcr.io/run-ai-demo/quickstart

  • <USER-NAME> The name of the user submitting the Job. The name is used for display purposes only when Run:AI is installed in an unauthenticated mode.

  • <REQUESTED-GPUs>. An integer number of GPUs you request to be allocated for the Job. Examples: 1, 2

Regular Jobs

Copy the following into a file and change the parameters:

apiVersion: run.ai/v1
kind: RunaiJob (* see note below)
metadata:
  name: <JOB-NAME>
  labels:
    priorityClassName: "build" (* see note below)
spec:
  template:
    metadata:
      labels:
        user: <USER-NAME>
    spec:
      containers:
      - name: <JOB-NAME>
        image: <IMAGE-NAME>
        resources:
          limits:
            nvidia.com/gpu: <REQUESTED-GPUs>
      restartPolicy: Never
      schedulerName: runai-scheduler

To submit the job, run:

kubectl apply -f <FILE-NAME>

Note

  • You can use either a regular Job or RunaiJob. The latter is a Run:AI object which solves various Kubernetes Bugs and provides a better naming for multiple pods in Hyper-Parameter Optimization scenarios
  • Using build in the priorityClassName field is equivalent to running a job via the CLI with a '--interactive' flag. To run a Train job, delete this line.
  • The runai submit CLI command includes many more flags. These flags can be correlated with Kubernetes API functions and added to the YAML above.

Using Fractional GPUs

Jobs with Fractions require a change in the above YAML. Specifically, the limits section:

limits:
  nvidia.com/gpu: <REQUESTED-GPUs>

should be omitted and replaced with:

spec:
  template:
    metadata:
      annotations:
        gpu-fraction: "0.5"

where "0.5" is the requested GPU fraction.

Delete Workloads

To delete a Run:AI workload, delete the Job:

kubectl delete runaijob <JOB-NAME>

See Also


Last update: April 11, 2021