Workloads Overview
Runai is an open platform and supports three types of workloads each with a different set of features:
- Run:ai native workloads.
- Third party integrations.
- Typical Kubernetes workloads.
Run:ai native workloads¶
Run:ai native workloads are workloads (trainings, workspaces, deployments) that are fully controlled by Run:ai. Run: workloads are the most comprehensive and include Third party integrations and Typical Kubernetes workload types. Specific characteristics of Run: ai native workloads include:
- Submitting of workloads via UI/CLI.
- Workload control (delete/stop/connect).
- Workload policies (default rules for all policies, specific workload policies, and enforcing of those rules).
- Scheduling rules.
- Role based access control.
Third party integrations¶
Third party integrations are tools that Run:ai supports and manages. These are tools that are typically used to build workloads for specific purposes. Third party integrations also include Typical Kubernetes workloads. Specific characteristics of third party tool support include:
- Smart gang scheduling (workload aware).
- Specific workload aware visibility so that different kinds of pods are identified as a single workload (for example, GPU Utilization, workload view, dashboards).
For more information, see Supported integrations.
Typical Kubernetes workloads¶
Typical Kubernetes workloads are any kind of workload built for Kubernetes. The Run:ai platform allows you to submit standard Kubernetes CRDs. Specific characteristics of Typical Kubernetes workloads that Run:ai can manage include:
- Fairness
- Nodepools
- Bin packing/spread
- Fractions
- Overprovisioning
Workloads View¶
Run:ai makes it easy to run machine learning workloads effectively on Kubernetes. Run:ai provides both a UI and API interface that introduces a simple and more efficient way to manage machine learning workloads, which will appeal to data scientists and engineers alike.
The Workloads view provides a replacement for the Jobs, Trainings, and Workspaces views. The new table format provides:
- Changing of the layout of the Workloads table by pressing Columns to add or remove columns from the table.
- Download the table to a CSV file by pressing More, then pressing Download as CSV.
- Search for a workload by pressing Search and entering the name of the workload.
- Advanced workload management.
- Added workload statuses for better tracking of workload flow.
To enable the Workloads view, press Try Workloads. To return to the Jobs view, press Go Back To Jobs View.
To create new workloads, press New Workload.
API Documentation¶
Access the platform API documentation for more information on using the API to manage workloads.
Managing Workloads¶
You can manage a workload by selecting one from the view. Once selected, you can:
- Delete a workload
- Connect
- Stop a workload
- Activate a workload
-
Show details—provides in-depth information about the selected workload including:
- Event history—workload status over time. Use the filter to search through the history for specific events.
- Metrics—metrics for GPU utilization, CPU usage, GPU memory usage, and CPU memory usage. Use the date selector to choose the time period for the metrics.
- Logs—logs of the current status. Use the Download button to download the logs.
Workloads Status¶
The Status column shows the current status of the workload. The following table describes the statuses presented:
Phase Name | Description | Entry Condition | Exit Condition |
---|---|---|---|
Creating | Workload setup is initiated in the cluster. Resources and pods are now provisioning. | A workload is submitted. | A multi-pod group is created. |
Pending | Workload is queued and awaiting resource allocation. | A pod group exists. | All pods are scheduled. |
Initializing | Workload is retrieving images, starting containers, and preparing pods. | All pods are scheduled—handling of multi-pod groups TBD. | All pods are initialized or a failure to initialize is detected. |
Running | Workload is currently in progress with all pods operational. | All pods initialized (all containers in pods are ready). | Job completion or failure. |
Degraded | Pods may not align with specifications, network services might be incomplete, or persistent volumes may be detached. Check your logs for specific details. | Pending—All pods are running but with issues. Running—All pods are running with no issues. | Running—All resources are OK. Completed— Job finished with fewer resources.Failed—Job failure or user-defined rules. |
Deleting | Workload and its associated resources are being decommissioned from the cluster. | Deleting of the workload. | Resources are fully deleted. |
Stopped | Workload is on hold and resources are intact but inactive. | Stopping the workload without deleting resources. | Transitioning back to the initializing phase or proceeded to deleting the workload. |
Failed | Image retrieval failed or containers experienced a crash. Check your logs for specific details. | An error occurs preventing the successful completion of the job. | Terminal state. |
Completed | Workload has successfully finished its execution. | The job has finished processing without errors. | Terminal state. |
Successful flow¶
A successful flow will follow the following flow chart:
flowchart LR
A(Creating) --> B(Pending)
B-->C(Initializing)
C-->D(Running)
D-->E(Completed)
To get the full experience of Run:ai’s environment and platform use the following types of workloads.
- Workspaces
- Trainings (Only available when using the Jobs view)
- Distributed trainings
- Deployment
Supported integrations¶
To assist you with other platforms, and other types of workloads use the integrations listed below.