Skip to content

Cluster Install

Below are instructions on how to install Run:AI cluster. Before installing, please review the installation prerequisites here: Run AI GPU Cluster Prerequisites.

Step 1: NVIDIA

On each machine with GPUs run the following steps 1.1 - 1.4. If you are using DGX OS 4.0 or later, you may skip to step 2.

Step 1.1 Install the CUDA Toolkit

Run:

nvidia-smi

If the command is not successful, you must install the CUDA Toolkit. Follow the instructions here to install. When the installation is finished you must reboot your computer.

Step 1.2: Install Docker

Install Docker by following the steps here: https://docs.docker.com/engine/install/. Specifically, you can use a convenience script provided in the document:

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

Warning

Kubernetes does not currently support the latest Docker version 20.

Step 1.3: Install NVIDIA Container Toolkit (previously named NVIDIA Docker)

To install NVIDIA Docker on Debian-based distributions (such as Ubuntu), run the following:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

For RHEL-based distributions, run:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

For a detailed review of the above instructions, see the NVIDIA Container Toolkit installation instructions.

Warning

Kubernetes does not currently support the NVIDIA container runtime, which is the successor of NVIDIA Docker/NVIDIA container toolkit.

Step 1.4: Make NVIDIA Docker the default docker runtime

Set the NVIDIA runtime as the default Docker runtime on your node. Edit the docker daemon config file at /etc/docker/daemon.json and add the default-runtime key as follows:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
Then run the following again:

sudo pkill -SIGHUP dockerd

Step 2: Install Kubernetes

There are several good ways to install Kubernetes. A full list can be found here: https://kubernetes.io/docs/setup/. Two good alternatives:

  1. Native installation. For simple Kubernetes installation, the easiest and fastest way to setup Kubernetes is through a Native Kubernetes Installation.
  2. Kubespray https://kubespray.io/. Kubespray uses Ansible scripts. Download the latest stable version of Kubespray from: https://github.com/kubernetes-sigs/kubespray.

Specific Kubernetes Variants

  • If you are installing Kubernetes using Rancher, please review the extra step here.
  • Run:AI Works with OpenShift. For specific installation instructions for OpenShift, please contact Run:AI customer support.

Warning

Run:AI is customizing the NVIDIA Kubernetes device plugin. Do not install this software as it is installed by Run:AI.

Step 3: Install Run:AI

The following next steps assume that you have the Kubernetes command-line kubectl on your laptop and that it is configured to point to a functioning Kubernetes cluster.

Step 3.1: Install Run:AI

  • Log in to Run:AI Admin UI at https://app.run.ai. Use credentials provided by Run:AI Customer Support.
  • If no clusters are configured, you will see a dialog with instructions on how to install a Run:AI cluster.
  • If a cluster has already been configured, open the menu on the top left and select "Clusters". On the top right-click "Add New Cluster".

Please read the next section before proceeding.

Step 3.2: Customize Installation

The Run:AI Admin UI cluster creation wizard asks you to download a YAML file runai-operator-<cluster-name>.yaml. You must then apply the file to Kubernetes. Before applying to Kubernetes, you may need to edit this file. Examples:

Step 4: Verify your Installation

  • Go to https://app.run.ai/dashboards/now.
  • Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.

For a more extensive verification of cluster health, see Determining the health of a cluster.

Step 5: (Optional) Set Node Roles

When installing a production cluster you may want to:

  • Set one or more Run:AI system nodes. These are nodes dedicated to Run:AI software.
  • Machine learning frequently requires jobs that require CPU but not GPU. You may want to direct these jobs to dedicated nodes that do not have GPUs, so as not to overload these machines.
  • Limit Run:AI to specific nodes in the cluster.

To perform these tasks. See Set Node Roles.

Next Steps


Last update: February 28, 2021