Skip to content

Prerequisites

Before proceeding with this document, please review the installation types documentation to understand the difference between air-gapped and connected installations.

Hardware Requirements

(Production only) Run:AI System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:AI Software. The nodes do not have to be dedicated to Run:AI, but for Run:AI purposes we would need:

  • 4 CPUs
  • 8GB of RAM
  • 120GB of Disk space

The backend installation of Run:AI will require the configuration of Kubernetes Persistent Volumes of a total size of 110GB.

Run:AI Software Prerequisites

You should receive a file: runai-gcr-secret.yaml from Run:AI Customer Support. The file provides access to the Run:AI Container registry.

You should receive a single file runai-<version>.tar from Run:AI customer support

Kubernetes

Run:AI requires Kubernetes 1.19 or above. Kubernetes 1.21 is recommended (as of September 2021). Kubernetes 1.22 is not supported.

If you are using OpenShift, please refer to our OpenShift installation instructions.

Run:AI Supports Kubernetes Pod Security Policy if used.

NVIDIA Prerequisites

Run:AI requires the installation of NVIDIA software. See installation details here

Kubernetes Dependencies

Prometheus

The Run:AI Cluster installation installs Prometheus. However, it can also connect to an existing Prometheus installed by the organization. In the latter case, it's important to:

Feature Discovery

The Run:AI Cluster installation installs Kubernetes Node Feature Discovery (NFD) and NVIDIA GPU Feature Discovery (GFD). If your cluster has these dependencies already installed, you can use installation flags to prevent Run:AI from installing these dependencies.

Network

  • Shared Storage. Network address and a path to a folder in a Network File System
  • All Kubernetes cluster nodes should be able to mount NFS folders. Usually, this requires the installation of the nfs-common package on all machines (sudo apt install nfs-common or similar)
  • IP Address. An available, internal IP Address that is accessible from Run:AI Users' machines (referenced below as <RUNAI_IP_ADDRESS>)
  • DNS entry Create a DNS A record such as runai.<company-name> or similar. The A record should point to <RUNAI_IP_ADDRESS>
  • A certificate for the endpoint. The certificate(s) must be signed by the organization's root CA.

Installer Machine

The machine running the installation script (typically the Kubernetes master) must have:

  • At least 50GB of free space.
  • Docker installed.

Other

  • (Airgapped installation only) Private Docker Registry. Run:AI assumes the existence of a Docker registry for images. Most likely installed within the organization. The installation requires the network address and port for the registry (referenced below as <REGISTRY_URL>).
  • (Optional) SAML Integration.

Pre-install Script

Once you believe that the Run:AI prerequisites are met, we highly recommend installing and running the Run:AI pre-install diagnostics script. The tool:

  • Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
  • Looks at additional components installed and analyzes their relevancy to a successful Run:AI installation.

To use the script download the latest version of the script and run:

chmod +x preinstall-diagnostics-<platform>
./preinstall-diagnostics-<platform> --domain <dns-entry>

If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:AI, locate the file runai-preinstall-diagnostics.txt in the current directory and send it to Run:AI technical support.

For more information on the script including additional command-line flags, see here.


Last update: January 19, 2022