Skip to content


Below are the prerequisites of a cluster installed with Run:AI.

Software Requirements


Run:AI has been tested with the following certified Kubernetes distributions:

Target Platform Description Notes
Vanilla Kubernetes Using no specific distribution but rather k8s native installation
EKS Amazon Elastic Kubernetes Service
AKS Azure Kubernetes Services
GKE Google Kubernetes Engine
OCP OpenShift Container Platform The Run:AI operator is certified for OpenShift by Red Hat.
RKE Rancher Kubernetes Engine When installing Run:AI, select On Premise. You must perform the mandatory extra step here.
Ezmeral HPE Ezmeral Container Platform See Run:AI at Ezmeral marketplace
Tanzu VMWare Kubernetes Tanzu supports containerd rather than docker. The NVIDIA prerequisites need to take this into account. See more below

A full list of Kubernetes partners can be found here: In addition, Run:AI provides instructions for a simple (non production-ready) Kubernetes Installation.


  • Run:AI requires Kubernetes 1.19 or above. Kubernetes 1.21 is recommended.
  • Kubernetes recommends the usage of the systemd as the container runtime cgroup driver. Kubernetes 1.22 and above defaults to systemd.
  • If you are using RedHat OpenShift. Run:AI requires OpenShift 4.6 or later.
  • Run:AI Supports Kubernetes Pod Security Policy if used.


NVIDIA pre-requisites are provided in detail in the NVIDIA documentation. The following provides a walkthrough of the documentation steps:

There are two alternatives for installing NVIDIA prerequisites:

  1. (Recommended) Install the NVIDIA GPU Operator on Kubernetes. This single installation contains therequired NVIDIA drivers and software for all nodes within the Kubernetes cluster containing NVIDIA GPUs.
  2. Install NVIDIA software on each node separately.


  • If you are using DGX OS then NVIDIA prerequisites are already installed and you may skip to the next step.
  • The combination of NVIDIA A100 hardware and the CoreOS operating system (which is popular when using OpenShift) will only work with option 1 and using the GPU Operator version 1.8 or higher.

Follow the Getting Started guide.

  • Note that the document contains a separate section in the case where the NVIDIA CUDA Toolkit is already installed on the nodes.
  • To work with containerd (e.g. for Tanzu), change the defaultRuntime accordingly.
  • Run:AI has customized the NVIDIA device plugin for Kubernetes and NVIDIA DCGM Exporter. Run the following to disable the existing plug-ins:
kubectl -n gpu-operator-resources patch daemonset nvidia-device-plugin-daemonset \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

Install the NVIDIA Dependencies in this guide.

  • Perform the sections Install NVIDIA Drivers and Install NVIDIA Container Toolkit (nvidia-docker2).
  • Do not perform the section Install NVIDIA Device Plugin.
  • Note the differntiation between containerd and docker.
  • The instructions relate to Ubuntu and link to other Operating systems.


Run:AI requires Prometheus. The Run:AI Cluster installation will, by default, install Prometheus, but it can also connect to an existing Prometheus instance installed by the organization. In the latter case, it's important to:

Distributed Training via Kubeflow MPI

Distributed training is the ability to run workloads on multiple nodes (not just multiple GPUs on the same node). Run:AI provides this capability via Kubeflow MPI. If you need this functionality, you will need to install the Kubeflow MPI Operator. Use the following installation guide. Note that Run:AI does not currently support the latest MPI version. but only version v0.2.3.

As per instructions:

  • Clone tag v0.2.3 (and not master)
  • vi mpi-operator/deploy/v1alpha2/mpi-operator.yaml, search for image and change it to mpioperator/mpi-operator:v0.2.3.


Kubeflow MPI requires containers to run as root, which will not work well when running on OpenShift or when PodSecurityPolicy is enabled in Kubernetes.

Hardware Requirements

(see picture below)

  • (Production only) Run:AI System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:AI Software. The nodes do not have to be dedicated to Run:AI, but for Run:AI purposes we would need:

    • 4 CPUs
    • 8GB of RAM
    • 50GB of Disk space
  • Shared data volume: Run:AI uses Kubernetes to abstract away the machine on which a container is running:

    • Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
    • The Run:AI system needs to save data on a storage device that is not dependent on a specific node.

    Typically, this is achieved via Network File Storage (NFS) or Network-attached storage (NAS).

  • Docker Registry: With Run:AI, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-prem (preferably on a dedicated machine). Run:AI can assist with setting up the repository.

  • Kubernetes: Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation.


User requirements

Usage of containers and images: The individual Researcher's work should be based on container images.

Network Requirements

Internal networking: Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes all cluster nodes can interconnect using all ports.

Outbound network: Run:AI user interface runs from the cloud. All container nodes must be able to connect to the Run:AI cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is proxied/limited, the following exceptions should be applied:

During Installation

Run:AI requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:

Name Description URLs Ports

Run:AI Repository

The Run:AI Package Repository is hosted on Run:AI’s account on Google Cloud


Docker Images Repository

Various Run:AI images


Docker Images Repository

Various third party Images


Post Installation

In addition, once running, Run:AI will send metrics to two sources:

Name Description URLs Ports


Grafana Metrics Server



Run:AI Cloud instance



Authentication Provider


Pre-install Script

Once you believe that the Run:AI prerequisites are met, we highly recommend installing and running the Run:AI pre-install diagnostics script. The tool:

  • Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
  • Looks at additional components installed and analyzes their relevancy to a successful Run:AI installation.

To use the script download the latest version of the script and run:

chmod +x preinstall-diagnostics-<platform>

If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:AI, locate the file runai-preinstall-diagnostics.txt in the current directory and send it to Run:AI technical support.

For more information on the script including additional command-line flags, see here.

Last update: January 19, 2022