Before proceeding with this document, please review the installation types documentation to understand the difference between air-gapped and connected installations.
(Production only) Run:AI System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:AI Software. The nodes do not have to be dedicated to Run:AI, but for Run:AI purposes we would need:
- 4 CPUs
- 8GB of RAM
- 120GB of Disk space
The backend installation of Run:AI will require the configuration of Kubernetes Persistent Volumes of a total size of 110GB.
Run:AI Software Prerequisites¶
You should receive a single file
runai-<version>.tar from Run:AI customer support
You should receive a file:
runai-gcr-secret.yaml from Run:AI Customer Support. The file provides access to the Run:AI Container registry.
Run:AI requires Kubernetes 1.19 or above. Kubernetes 1.21 is recommended (as of September 2021). Kubernetes 1.22 is not supported.
If you are using OpenShift, please refer to our OpenShift installation instructions.
Run:AI Supports Kubernetes Pod Security Policy if used.
Run:AI requires the installation of NVIDIA software. These can be done in one of two ways:
- (Recommended) Use the NVIDIA GPU Operator on Kubernetes. To install the NVIDIA GPU Operator use the Getting Started guide. Follow the Helm based Installation.
- For each GPU node in the cluster, install NVIDIA CUDA Drivers, as well as the software stack, described here
The Run:AI Cluster installation installs Prometheus. However, it can also connect to an existing Prometheus installed by the organization. In the latter case, it's important to:
- Verify that both Prometheus Node Exporter and kube-state-metrics are installed. Both are part of the default Prometheus installation.
- Understand how Prometheus has been installed. Whether directly or using the Prometheus Operator. The distinction will become important during the Run:AI Cluster installation.
The Run:AI Cluster installation installs Kubernetes Node Feature Discovery (NFD) and NVIDIA GPU Feature Discovery (GFD). If your cluster has these dependencies already installed, you can use installation flags to prevent Run:AI from installing these dependencies.
- Shared Storage. Network address and a path to a folder in a Network File System
- All Kubernetes cluster nodes should be able to mount NFS folders. Usually, this requires the installation of the
nfs-commonpackage on all machines (
sudo apt install nfs-commonor similar)
- IP Address. An available, internal IP Address that is accessible from Run:AI Users' machines (referenced below as
- DNS entry Create a DNS A record such as
runai.<company-name>or similar. The A record should point to
- A certificate for the endpoint. The certificate(s) must be signed by the organization's root CA.
The machine running the installation script (typically the Kubernetes master) must have:
- At least 50GB of free space.
- Docker installed.
- (Airgapped installation only) Private Docker Registry. Run:AI assumes the existence of a Docker registry for images. Most likely installed within the organization. The installation requires the network address and port for the registry (referenced below as
- (Optional) SAML Integration.