Skip to content

Cluster Install

Below are instructions on how to install Run:AI cluster. Before installing, please review the installation prerequisites here: Run AI GPU Cluster Prerequisites.

Step 1: Kubernetes

Run:AI has been tested with the following certified Kubernetes distributions:

Target Platform Description Notes
On-Premise Kubernetes is installed by the customer and not managed by a service Example: Native installation, Kubespray
EKS Amazon Elastic Kubernetes Service
AKS Azure Kubernetes Services
GKE Google Kubernetes Engine
OCP OpenShift Container Platform The Run:AI operator is certified for OpenShift by Red Hat.
RKE Rancher Kubernetes Engine When installing Run:AI, select On Premise. You must perform the mandatory extra step here.
Ezmeral HPE Ezmeral Container Platform See Run:AI at Ezmeral marketplace

A full list of Kubernetes partners can be found here: https://kubernetes.io/docs/setup/. In addition, Run:AI provides instructions for a simple (non production-ready) Kubernetes Installation.

Step 2: NVIDIA

There are two alternatives for installing NVIDIA prerequisites:

  1. Use the NVIDIA GPU Operator on Kubernetes. To install the NVIDIA GPU Operator use the Getting Started guide.
  2. Install the NVIDIA CUDA Toolkit and NVIDIA Docker on each node with GPUs. See NVIDIA Drivers installation for details.

Important

  • The options are mutually exclusive. If the NVIDIA CUDA toolkit is installed, you will not be able to install the NVIDIA GPU Operator.
  • NVIDIA GPU Operator does not currently work with DGX. If you are using DGX OS then NVIDIA prerequisites are already installed and you may skip to the next step.

NVIDIA Device Plugin

Run:AI has customized the NVIDIA device plugin for Kubernetes. If you have installed the NVIDIA GPU Operator or have previously installed this plug-in, run the following to disable the existing plug-in:

kubectl -n gpu-operator-resources patch daemonset nvidia-device-plugin-daemonset \
   -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

Step 3: Install Run:AI

Log in to Run:AI Admin UI at https://app.run.ai. Use credentials provided by Run:AI Customer Support:

  • If no clusters are currently configured, you will see a Cluster installation wizard
  • If a cluster has already been configured, use the menu on the top left and select "Clusters". On the top right, click "Add New Cluster".

Using the Wizard:

  1. Choose a target Kubernetes platform (see table above)
  2. Download a Helm values YAML file runai-<cluster-name>.yaml
  3. (Optional) customize the values file. See Customize Cluster Installation
  4. Install Helm
  5. Run:
helm repo add runai https://run-ai-charts.storage.googleapis.com
helm repo update

helm install runai-cluster runai/runai-cluster -n runai --create-namespace \
    -f runai-<cluster-name>.yaml

Step 4: Verify your Installation

  • Go to https://app.run.ai/dashboards/now.
  • Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.

For a more extensive verification of cluster health, see Determining the health of a cluster.

Step 5: (Optional) Set Node Roles

When installing a production cluster you may want to:

  • Set one or more Run:AI system nodes. These are nodes dedicated to Run:AI software.
  • Machine learning frequently requires jobs that require CPU but not GPU. You may want to direct these jobs to dedicated nodes that do not have GPUs, so as not to overload these machines.
  • Limit Run:AI to specific nodes in the cluster.

To perform these tasks. See Set Node Roles.

Next Steps


Last update: July 22, 2021