Skip to content

Cluster Install

Below are instructions on how to install Run:AI cluster. Before installing, please review the installation prerequisites here: Run AI GPU Cluster Prerequisites.

Step 1: Kubernetes

Run:AI has been tested with the following certified Kubernetes distributions:

Target Platform Description Notes
on-prem Kubernetes is installed by the customer and not managed by a service Example: Native installation, Kubespray
EKS Amazon Elastic Kubernetes Service
AKS Azure Kubernetes Services
GKE Google Kubernetes Engine
OCP OpenShift Container Platform The Run:AI operator is certified for OpenShift by Red Hat. Note: Run:AI can only be deployed on OpenShift using the Self-Hosted installation
RKE Rancher Kubernetes Engine When installing Run:AI, select On Premise. You must perform the mandatory extra step here.
Ezmeral HPE Ezmeral Container Platform See Run:AI at Ezmeral marketplace

A full list of Kubernetes partners can be found here: https://kubernetes.io/docs/setup/. In addition, Run:AI provides instructions for a simple (non production-ready) Kubernetes Installation.

Step 2: NVIDIA

There are two alternatives for installing NVIDIA prerequisites:

  1. Install the NVIDIA CUDA Toolkit and NVIDIA Docker on each node with GPUs. See NVIDIA Drivers installation for details.
  2. Install the NVIDIA GPU Operator on Kubernetes once. To install, use the Getting Started guide. Note that the document contains a separate section in the case where the NVIDIA CUDA Toolkit is already installed on the nodes.

Important

  • If you are using DGX OS then NVIDIA prerequisites are already installed and you may skip to the next step.
  • The combination of NVIDIA A100 hardware and the CoreOS operating system (which is popular when using OpenShift) will only work with option 1: the GPU Operator version 1.8 or higher.

NVIDIA Device Plugin

Run:AI has customized the NVIDIA device plugin for Kubernetes. If you have installed the NVIDIA GPU Operator or have previously installed this plug-in, run the following to disable the existing plug-in:

kubectl -n gpu-operator-resources patch daemonset nvidia-device-plugin-daemonset \
   -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

Step 3: Install Run:AI

Log in to Run:AI Admin UI at https://app.run.ai. Use credentials provided by Run:AI Customer Support:

  • If no clusters are currently configured, you will see a Cluster installation wizard
  • If a cluster has already been configured, use the menu on the top left and select "Clusters". On the top right, click "Add New Cluster".

Using the Wizard:

  1. Choose a target Kubernetes platform (see table above)
  2. Download a Helm values YAML file runai-<cluster-name>.yaml
  3. (Optional) customize the values file. See Customize Cluster Installation
  4. Install Helm
  5. For RKE only, perform the steps here
  6. Run:
helm repo add runai https://run-ai-charts.storage.googleapis.com
helm repo update

helm install runai-cluster runai/runai-cluster -n runai --create-namespace \
    -f runai-<cluster-name>.yaml

Step 4: Verify your Installation

  • Go to https://app.run.ai/dashboards/now.
  • Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.

For a more extensive verification of cluster health, see Determining the health of a cluster.

Step 5: (Optional) Set Node Roles

When installing a production cluster you may want to:

  • Set one or more Run:AI system nodes. These are nodes dedicated to Run:AI software.
  • Machine learning frequently requires jobs that require CPU but not GPU. You may want to direct these jobs to dedicated nodes that do not have GPUs, so as not to overload these machines.
  • Limit Run:AI to specific nodes in the cluster.

To perform these tasks. See Set Node Roles.

Next Steps


Last update: October 13, 2021