Submit a Run:AI Job via YAML¶
The easiest way to submit Jobs to the Run:AI GPU cluster is via the Run:AI Command-line interface (CLI). Still, the CLI is not a must. It is only a wrapper for a more detailed Kubernetes API syntax using YAML.
There are cases where you want to forgo the CLI and use direct YAML calls. A frequent scenario for using the Kubernetes YAML syntax to submit Jobs is integrations. Researchers may already be working with an existing system that submits Jobs, and want to continue working with the same system. Though it is possible to call the Run:AI CLI from the customer's integration, it is sometimes not enough.
We differentiate between three types of Workloads:
- Train workloads. Train workloads are characterized by a deep learning session that has a start and an end. A Training session can take anywhere from a few minutes to a couple of weeks. It can be interrupted in the middle and later restored. Training workloads typically utilize large percentages of GPU computing power and memory.
- Build workloads. Build workloads are interactive. They are used by data scientists to write machine learning code and test it against subsets of the data. Build workloads typically do not maximize usage of the GPU.
- Inference workloads. Inference workloads are used for serving models in production. For details on how to submit Inference workloads via YAML see here.
- A Kubernetes Job is used for Train workloads. A Job has a distinctive "end" at which time the Job is either "Completed" or "Failed"
- A Kubernetes StatefulSet is used for Build workloads. Build workloads are interactive sessions. StatefulSets do not end on their own. Instead, they must be manually stopped
Run:AI extends the Kubernetes Scheduler. A Kubernetes Scheduler is the software that determines which workload to start on which node. Run:AI provides a custom scheduler named
The Run:AI scheduler schedules computing resources by associating Workloads with Run:AI Projects:
- A Project is assigned with a GPU quota through the Run:AI Administrator user interface.
- A workload must be associated with a Project name and will receive resources according to the defined quota for the Project and the currently running Workloads
Internally, Run:AI Projects are implemented as Kubernetes namespaces. The scripts below assume that the code is being run after the relevant namespace has been set.
<JOB-NAME>. The name of the Job.
<IMAGE-NAME>. The name of the docker image to use. Example:
<USER-NAME>The name of the user submitting the Job. The name is used for display purposes only when Run:AI is installed in an unauthenticated mode.
<REQUESTED-GPUs>. An integer number of GPUs you request to be allocated for the Job. Examples: 1, 2
<NAMESAPCE>The name of the Project's namespace. This is usually
Copy the following into a file and change the parameters:
apiVersion: run.ai/v1 kind: RunaiJob (* see note below) metadata: name: <JOB-NAME> namespace: <NAMESPACE> labels: priorityClassName: "build" (* see note below) spec: template: metadata: labels: user: <USER-NAME> spec: containers: - name: <JOB-NAME> image: <IMAGE-NAME> resources: limits: nvidia.com/gpu: <REQUESTED-GPUs> restartPolicy: Never schedulerName: runai-scheduler
To submit the job, run:
kubectl apply -f <FILE-NAME>
- You can use either a regular
RunaiJob. The latter is a Run:AI object which solves various Kubernetes Bugs and provides a better naming for multiple pods in Hyper-Parameter Optimization scenarios
priorityClassNamefield is equivalent to running a job via the CLI with a '--interactive' flag. To run a Train job, delete this line.
- The runai submit CLI command includes many more flags. These flags can be correlated with Kubernetes API functions and added to the YAML above.
Using Fractional GPUs¶
TO submit a Job with fractions of a GPU, replace
<REQUESTED-GPUs> with a fraction in quotes. e.g.
limits: nvidia.com/gpu: "0.5"
where "0.5" is the requested GPU fraction.
Mapping Additional Flags¶
runai submit has a significant number of flags. The easiest way to find out the mapping from a flag to the correct YAML attribute is to use the
For example, to find the location of the
--large-shm flag, run:
> runai submit -i ubuntu --large-shm --dry-run Template YAML file can be found at: /var/folders/xb/rnf9b1bx2jg45c7jprv71d9m0000gn/T/job.yaml185826190
To delete a Run:AI workload, delete the Job:
kubectl delete runaijob <JOB-NAME>
- See how to use the above YAML syntax with Kubernetes API
- Use the Researcher REST API to submit, list and delete Jobs.