Quickstart: Launch Unattended Training Workloads¶
Introduction¶
The purpose of this article is to provide a quick ramp-up to running an unattended training Workload. Training Workloads are containers that execute a program on start and close down automatically when the task is done.
With this Quickstart you will learn how to:
- Start a deep learning training workload.
- View training workload status and resource consumption using the Run:ai user interface and the Run:ai CLI.
- View training workload logs.
- Stop the training workload.
There are various ways to submit a training Workload:
- Run:ai command-line interface (CLI)
- Run:ai user interface
- Run:ai API
Prerequisites¶
To complete this Quickstart, the Platform Administrator will need to provide you with:
- Researcher access to Project in Run:ai named "team-a"
- The project should be assigned a quota of at least 1 GPU.
- A URL of the Run:ai Console. E.g. https://acme.run.ai.
To complete this Quickstart via the CLI, you will need to have the Run:ai CLI installed on your machine. There are two available CLI variants:
- The older V1 CLI. See installation here
- A newer V2 CLI, supported with clusters of version 2.18 and up. See installation here
Step by Step Walkthrough¶
Login¶
Run runai login
and enter your credentials.
Run runai login
and enter your credentials.
Browse to the provided Run:ai user interface and log in with your credentials.
To use the API, you will need to obtain a token. Please follow the api authentication article.
Run Workload¶
Open a terminal and run:
Note
For more information on the workload submit command, see cli documentation.
Open a terminal and run:
Note
For more information on the training submit command, see cli documentation.
- In the Run:ai UI select Workloads
- Select New Workload and then Training
- You should already have
Cluster
,Project
and astart from scratch
Template
selected. Entertrain1
as the name and press CONTINUE. - Select NEW ENVIRONMENT. Enter
quickstart
as the name andrunai.jfrog.io/demo/quickstart
as the image. Then select CREATE ENVIRONMENT. - When the previous screen comes up, select
one-gpu
under the Compute resource. - Select CREATE TRAINING.
Note
For more information on submitting Workloads and creating Assets via the user interface, see Workload documentation.
curl -L 'https://<COMPANY-URL>/api/v1/workloads/trainings' \ # (1)
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ # (2)
-d '{
"name": "train1",
"projectId": "<PROJECT-ID>", '\ # (3)
"clusterId": "<CLUSTER-UUID>", \ # (4)
"spec": {
"image": "runai.jfrog.io/demo/quickstart",
"compute": {
"gpuDevicesRequest": 1
}
}
}'
<COMPANY-URL>
is the link to the Run:ai user interface. For exampleacme.run.ai
<TOKEN>
is an API access token. see above on how to obtain a valid token.<PROJECT-ID>
is the the ID of theteam-a
Project. You can get the Project ID via the Get Projects API<CLUSTER-UUID>
is the unique identifier of the Cluster. You can get the Cluster UUID by adding the "Cluster ID" column to the Clusters view.
Note
- The above API snippet will only work with Run:ai clusters of 2.18 and above. For older clusters, use, the now deprecated Cluster API.
- For more information on the Training Submit API see API Documentation
This would start an unattended training Workload for team-a
with an allocation of a single GPU. The Workload is based on a sample docker image runai.jfrog.io/demo/quickstart
. We named the Workload train1
List Workloads¶
Follow up on the Workload's progress by running:
The result:
Typical statuses you may see:
- ContainerCreating - The docker container is being downloaded from the cloud repository
- Pending - the Workload is waiting to be scheduled
- Running - the Workload is running
- Succeeded - the Workload has ended
A full list of Workload statuses can be found here
Describe Workload¶
To get additional status on your Workload run:
View Logs¶
Run the following:
Stop Workload¶
Run the following:
This would stop the training workload. You can verify this by listing training workloads again.
Next Steps¶
- Follow the Quickstart document: Launch Interactive Workloads.
- Use your container to run an unattended training workload.