Run:ai System Components¶

Components¶

Run:ai is made up of two components:

The Run:ai cluster provides scheduling services and workload management.
The Run:ai control plane provides resource management, Workload submission and cluster monitoring.

Technology-wise, both are installed over a Kubernetes Cluster.

Run:ai users:

Researchers submit Machine Learning workloads via the Run:ai Console, the Run:ai Command-Line Interface (CLI), or directly by sending YAML files to Kubernetes.
Administrators monitor and set priorities via the Run:ai User Interface

Run:ai Cluster¶

Run:ai comes with its own Scheduler. The Run:ai scheduler extends the Kubernetes scheduler. It uses business rules to schedule workloads sent by Researchers.
Run:ai schedules Workloads. Workloads include the actual researcher code running as a Kubernetes container, together with all the system resources required to run the code, such as user storage, network endpoints to access the container etc.
The cluster uses an outbound-only, secure connection to synchronize with the Run:ai control plane. Information includes meta-data sync and various metrics on Workloads, Nodes etc.
The Run:ai cluster is installed as a Kubernetes Operator
Run:ai is installed in its own Kubernetes namespace named runai
Workloads are run in the context of Run:ai Projects. Each Project is mapped to a Kubernetes namespace with its own settings and access control.

Run:ai Control Plane on the cloud¶

The Run:ai control plane is used by multiple customers (tenants) to manage resources (such as Projects & Departments), submit Workloads and monitor multiple clusters.

A single Run:ai customer (tenant) defined in the control-plane, can manage multiple Run:ai clusters. So a single customer, can manage mutltiple GPU clusters in multiple locations/subnets from a single interface.

Self-hosted Control-Plane¶

The Run:ai control plane can also be locally installed. To understand the various installation options see the installation types document.