Before proceeding with this document, please review the installation types documentation to understand the difference between air-gapped and connected installations.
(Production only) Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:
- 4 CPUs
- 8GB of RAM
- 120GB of Disk space
The control plane installation of Run:ai requires the configuration of Kubernetes Persistent Volumes of a total size of 110GB.
Run:ai Software Prerequisites¶
You should receive a file:
runai-gcr-secret.yaml from Run:ai Customer Support. The file provides access to the Run:ai Container registry.
You should receive a single file
runai-<version>.tar from Run:ai customer support
Run:ai requires Kubernetes. Supported versions are 1.21 through 1.24. Kubernetes 1.25 is not yet supported.
If you are using OpenShift, please refer to our OpenShift installation instructions.
Run:ai Supports Kubernetes Pod Security Policy if used.
Run:ai requires the installation of NVIDIA software. See installation details here
The Run:ai Cluster installation installs Prometheus. However, it can also connect to an existing Prometheus installed by the organization. In the latter case, it's important to:
- Verify that both Prometheus Node Exporter and kube-state-metrics are installed. Both are part of the default Prometheus installation.
- Understand how Prometheus has been installed. Whether directly or using the Prometheus Operator. The distinction will become important during the Run:ai Cluster installation.
- Shared Storage. Network address and a path to a folder in a Network File System
- All Kubernetes cluster nodes should be able to mount NFS folders. Usually, this requires the installation of the
nfs-commonpackage on all machines (
sudo apt install nfs-commonor similar)
- IP Address. An available, internal IP Address that is accessible from Run:ai Users' machines (referenced below as
- DNS entry Create a DNS A record such as
runai.<company-name>or similar. The A record should point to
- A certificate for the endpoint. The certificate(s) must be signed by the organization's root CA.
The machine running the installation script (typically the Kubernetes master) must have:
- At least 50GB of free space.
- Docker installed.
- (Airgapped installation only) Private Docker Registry. Run:ai assumes the existence of a Docker registry for images. Most likely installed within the organization. The installation requires the network address and port for the registry (referenced below as
- (Optional) SAML Integration.
Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:
- Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
- Looks at additional components installed and analyzes their relevancy to a successful Run:ai installation.
To use the script download the latest version of the script and run:
If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:ai, locate the file
runai-preinstall-diagnostics.txt in the current directory and send it to Run:ai technical support.
For more information on the script including additional command-line flags, see here.