Quickstart: Launch Unattended Training Workloads¶

Introduction¶

The purpose of this article is to provide a quick ramp-up to running an unattended training Workload. Training Workloads are containers that execute a program on start and close down automatically when the task is done.

With this Quickstart you will learn how to:

Start a deep learning training workload.
View training workload status and resource consumption using the Run:ai user interface and the Run:ai CLI.
View training workload logs.
Stop the training workload.

There are various ways to submit a training Workload:

Run:ai command-line interface (CLI)
Run:ai user interface
Run:ai API

Prerequisites¶

To complete this Quickstart, the Platform Administrator will need to provide you with:

Researcher access to Project in Run:ai named "team-a"
The project should be assigned a quota of at least 1 GPU.
A URL of the Run:ai Console. E.g. https://acme.run.ai.

To complete this Quickstart via the CLI, you will need to have the Run:ai CLI installed on your machine. There are two available CLI variants:

The older V1 CLI. See installation here
A newer V2 CLI, supported with clusters of version 2.18 and up. See installation here

Step by Step Walkthrough¶

CLI V1CLI V2User InterfaceAPI

Run runai login and enter your credentials.

Browse to the provided Run:ai user interface and log in with your credentials.

To use the API, you will need to obtain a token. Please follow the api authentication article.

Run Workload¶

CLI V1CLI V2User InterfaceAPI

Open a terminal and run:

runai config project team-a   
runai submit train1 -i runai.jfrog.io/demo/quickstart -g 1

Note

For more information on the workload submit command, see cli documentation.

Open a terminal and run:

runai project set team-a
runai training submit train1 -i runai.jfrog.io/demo/quickstart -g 1

Note

For more information on the training submit command, see cli documentation.

In the Run:ai UI select Workloads
Select New Workload and then Training
You should already have Cluster, Project and a start from scratch Template selected. Enter train1 as the name and press CONTINUE.
Select NEW ENVIRONMENT. Enter quickstart as the name and runai.jfrog.io/demo/quickstart as the image. Then select CREATE ENVIRONMENT.
When the previous screen comes up, select one-gpu under the Compute resource.
Select CREATE TRAINING.

Note

For more information on submitting Workloads and creating Assets via the user interface, see Workload documentation.

curl -L 'https://<COMPANY-URL>/api/v1/workloads/trainings' \ # (1)
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ # (2)
-d '{ 
    "name": "train1", 
    "projectId": "<PROJECT-ID>", '\ # (3)
    "clusterId": "<CLUSTER-UUID>", \ # (4)
    "spec": {
        "image": "runai.jfrog.io/demo/quickstart",
        "compute": {
        "gpuDevicesRequest": 1
        }
    }
}'

<COMPANY-URL> is the link to the Run:ai user interface. For example acme.run.ai
<TOKEN> is an API access token. see above on how to obtain a valid token.
<PROJECT-ID> is the the ID of the team-a Project. You can get the Project ID via the Get Projects API
<CLUSTER-UUID> is the unique identifier of the Cluster. You can get the Cluster UUID by adding the "Cluster ID" column to the Clusters view.

Note

The above API snippet will only work with Run:ai clusters of 2.18 and above. For older clusters, use, the now deprecated Cluster API.
For more information on the Training Submit API see API Documentation

This would start an unattended training Workload for team-a with an allocation of a single GPU. The Workload is based on a sample docker image runai.jfrog.io/demo/quickstart. We named the Workload train1

List Workloads¶

Follow up on the Workload's progress by running:

CLI V1CLI V2User Interface

runai list jobs

The result:

runai training list

The result:

Workload               Type        Status      Project     Preemptible      Running/Requested Pods     GPU Allocation
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
train1                 Training    Running     team-a      Yes              1/1                        0.00

Open the Run:ai user interface.
Under "Workloads" you can view the new Training Workload:

Select the Workloads and press Show Details to see the Workload details

Under Metrics you can see utilization graphs:

Typical statuses you may see:

ContainerCreating - The docker container is being downloaded from the cloud repository
Pending - the Workload is waiting to be scheduled
Running - the Workload is running
Succeeded - the Workload has ended

A full list of Workload statuses can be found here

Describe Workload¶

To get additional status on your Workload run:

CLI V1CLI V2User Interface

runai describe job train1

runai training describe train1

Workload parameters can be viewed by adding more columns to the Workload list and by reviewing the Event History tab for the specific Workload.

View Logs¶

Run the following:

CLI V1CLI V2User Interface

runai logs train1

You should see a log of a running container:

runai training logs train1

You should see a log of a running container:

Select the Workload, and press Show Details. Under Logs you can see the logs emitted from the container

Stop Workload¶

Run the following:

CLI V1CLI V2User Interface

runai delete job train1

runai training delete train1

Select the Workload and press DELETE.

This would stop the training workload. You can verify this by listing training workloads again.

Next Steps¶

Follow the Quickstart document: Launch Interactive Workloads.
Use your container to run an unattended training workload.

Quickstart: Launch Unattended Training Workloads¶

Introduction¶

Prerequisites¶

Step by Step Walkthrough¶

Login¶

Run Workload¶

List Workloads¶

Describe Workload¶

View Logs¶

Stop Workload¶

Next Steps¶