Cluster Install
Below are instructions on how to install Run:AI cluster. Before installing, please review the installation prerequisites here: Run AI GPU Cluster Prerequisites.
Step 1: NVIDIA¶
On each machine with GPUs run the following steps 1.1 - 1.4. If you are using DGX OS 4.0 or later, you may skip to step 2.
Step 1.1 Install the CUDA Toolkit¶
Run:
nvidia-smi
If the command is not successful, you must install the CUDA Toolkit. Follow the instructions here to install. When the installation is finished you must reboot your computer.
Step 1.2: Install Docker¶
Install Docker by following the steps here: https://docs.docker.com/engine/install/. Specifically, you can use a convenience script provided in the document:
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
Warning
Kubernetes does not currently support the latest Docker version 20.
Step 1.3: Install NVIDIA Container Toolkit (previously named NVIDIA Docker)¶
To install NVIDIA Docker on Debian-based distributions (such as Ubuntu), run the following:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
For RHEL-based distributions, run:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
For a detailed review of the above instructions, see the NVIDIA Container Toolkit installation instructions.
Warning
Kubernetes does not currently support the NVIDIA container runtime, which is the successor of NVIDIA Docker/NVIDIA container toolkit.
Step 1.4: Make NVIDIA Docker the default docker runtime¶
Set the NVIDIA runtime as the default Docker runtime on your node. Edit the docker daemon config file at /etc/docker/daemon.json
and add the default-runtime
key as follows:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo pkill -SIGHUP dockerd
Step 2: Install Kubernetes¶
There are several good ways to install Kubernetes. A full list can be found here: https://kubernetes.io/docs/setup/. Two good alternatives:
- Native installation. For simple Kubernetes installation, the easiest and fastest way to setup Kubernetes is through a Native Kubernetes Installation.
- Kubespray https://kubespray.io/. Kubespray uses Ansible scripts. Download the latest stable version of Kubespray from: https://github.com/kubernetes-sigs/kubespray.
Specific Kubernetes Variants¶
- If you are installing Kubernetes using Rancher, please review the extra step here.
- Run:AI Works with OpenShift. For specific installation instructions for OpenShift, please contact Run:AI customer support.
Warning
Run:AI is customizing the NVIDIA Kubernetes device plugin. Do not install this software as it is installed by Run:AI.
Step 3: Install Run:AI¶
The following next steps assume that you have the Kubernetes command-line kubectl on your laptop and that it is configured to point to a functioning Kubernetes cluster.
Step 3.1: Install Run:AI¶
- Log in to Run:AI Admin UI at https://app.run.ai. Use credentials provided by Run:AI Customer Support.
- If no clusters are configured, you will see a dialog with instructions on how to install a Run:AI cluster.
- If a cluster has already been configured, open the menu on the top left and select "Clusters". On the top right-click "Add New Cluster".
Please read the next section before proceeding.
Step 3.2: Customize Installation¶
The Run:AI Admin UI cluster creation wizard asks you to download a YAML file runai-operator-<cluster-name>.yaml
. You must then apply the file to Kubernetes. Before applying to Kubernetes, you may need to edit this file. Examples:
- Set aside an IP address for ingress access to containers (e.g. for Jupyter Notebooks, PyCharm, VisualStudio Code). See: Allow external access to Containers. Note that you can access containers via port forwarding without requiring an ingress point.
- Allow outbound internet connectivity in a proxied environment. See: Installing Run:AI with an Internet Proxy Server.
Step 4: Verify your Installation¶
- Go to https://app.run.ai/dashboards/now.
- Verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
For a more extensive verification of cluster health, see Determining the health of a cluster.
Step 5: (Optional) Set Node Roles¶
When installing a production cluster you may want to:
- Set one or more Run:AI system nodes. These are nodes dedicated to Run:AI software.
- Machine learning frequently requires jobs that require CPU but not GPU. You may want to direct these jobs to dedicated nodes that do not have GPUs, so as not to overload these machines.
- Limit Run:AI to specific nodes in the cluster.
To perform these tasks. See Set Node Roles.
Next Steps¶
- Set up Admin UI Users Working with Admin UI Users.
- Set up Projects Working with Projects.
- Set up Researchers to work with the Run:AI Command-line interface (CLI). See Installing the Run AI Command-line Interface on how to install the CLI for users.
- Set up Project-based Researcher Access Control.