Cluster Install
Below are instructions on how to install a Run:ai cluster.
Prerequisites¶
Before installing, please review the installation prerequisites here: Run:ai GPU Cluster Prerequisites.
Important
We strongly recommend running the Run:ai pre-install script to verify that all prerequisites are met.
Install Run:ai¶
Log in to Run:ai user interface at <company-name>.run.ai
. Use credentials provided by Run:ai Customer Support:
- If no clusters are currently configured, you will see a Cluster installation wizard.
- If a cluster has already been configured, use the menu on the top left and select
Clusters
. On the top left, clickNew Cluster
.
Using the cluster wizard:
- Choose a name for your cluster.
- Choose the Run:ai version for the cluster.
- Choose a target Kubernetes distribution (see table for supported distributions).
- (SaaS and remote self-hosted cluster only) Enter a URL for the Kubernetes cluster. The URL need only be accessible within the organization's network. For more informtaion see here.
- Press
Continue
.
On the next page:
- (SaaS and remote self-hosted cluster only) Install a trusted certificate to the domain entered above.
- Run the Helm command provided in the wizard.
Verify your Installation¶
- Go to
<company-name>.run.ai/dashboards/now
. - Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
- Run:
kubectl get cm runai-public -n runai -o jsonpath='{.data}' | yq -P
(assumes that yq is instaled)
Example output:
cluster-version: 2.9.0
runai-public:
version: 2.9.0
runaiConfigStatus: # (1)
conditions:
- type: DependenciesFulfilled
status: "True"
reason: dependencies_fulfilled
message: Dependencies are fulfilled
- type: Deployed
status: "True"
reason: deployed
message: Resources Deployed
- type: Available
status: "True"
reason: available
message: System Available
- type: Reconciled
status: "True"
reason: reconciled
message: Reconciliation completed successfully
optional: # (2)
knative: # (3)
components:
hpa:
available: true
knative:
available: true
kourier:
available: true
mpi: # (4)
available: true
- Verifies that all mandatory dependencies are met: NVIDIA GPU Operator, Prometheus and NGINX controller.
- Checks whether optional product dependencies have been met.
- See Inference prerequisites.
- See distributed training prerequisites.
For a more extensive verification of cluster health, see Determining the health of a cluster.
Researcher Authentication¶
If you will be using the Run:ai command-line interface or sending YAMLs directly to Kubernetes, you must now set up Researcher Access Control.
Customize your installation¶
To customize specific aspects of the cluster installation see customize cluster installation.
(Optional) Set Node Roles¶
When installing a production cluster you may want to:
- Set one or more Run:ai system nodes. These are nodes dedicated to Run:ai software.
- Machine learning frequently requires jobs that require CPU but not GPU. You may want to direct these jobs to dedicated nodes that do not have GPUs, so as not to overload these machines.
- Limit Run:ai to specific nodes in the cluster.
To perform these tasks. See Set Node Roles.
Next Steps¶
- Set up Run:ai Users Working with Users.
- Set up Projects for Researchers Working with Projects.
- Set up Researchers to work with the Run:ai Command-line interface (CLI). See Installing the Run:ai Command-line Interface on how to install the CLI for users.
- Review advanced setup and maintenance scenarios.