Skip to content

Prerequisites

Below are the prerequisites of a cluster installed with Run:ai.

Software Requirements

Kubernetes

Run:ai has been tested with the following certified Kubernetes distributions:

Target Platform Description Notes
Vanilla Kubernetes Using no specific distribution but rather k8s native installation
EKS Amazon Elastic Kubernetes Service
AKS Azure Kubernetes Services
GKE Google Kubernetes Engine GKE has a different software stack for NVIDIA.To install Run:ai on GKE please contact customer support.
OCP OpenShift Container Platform The Run:ai operator is certified for OpenShift by Red Hat.
RKE Rancher Kubernetes Engine When installing Run:ai, select On Premise. You must perform the mandatory extra step here.
Ezmeral HPE Ezmeral Container Platform See Run:ai at Ezmeral marketplace
Tanzu VMWare Kubernetes Tanzu supports containerd rather than docker. See the NVIDIA prerequisites below as well as cluster customization for changes required for containerd
Canonical Kubernetes a.k.a Charmed Kubernetes

A full list of Kubernetes partners can be found here: https://kubernetes.io/docs/setup/. In addition, Run:ai provides instructions for a simple (non production-ready) Kubernetes Installation.

Notes

  • Run:ai requires Kubernetes. Supported versions are 1.19 through 1.23.
  • Kubernetes recommends the usage of the systemd as the container runtime cgroup driver. Kubernetes 1.22 and above defaults to systemd.
  • If you are using RedHat OpenShift. Run:ai supports OpenShift 4.6 to 4.9.
  • Run:ai Supports Kubernetes Pod Security Policy if used.

NVIDIA

Important

  • If you are using DGX OS then NVIDIA prerequisites are already installed and you may skip to the next step.
  • The combination of NVIDIA A100 hardware and the CoreOS operating system (which is popular when using OpenShift) will only work with GPU Operator version 1.8 or higher.

Follow the Getting Started guide to install the NVIDIA GPU Operator version 1.9 or higher.

  • We recommend to use the default namespace gpu-operator. Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.
  • Note that the NVIDIA document contains a separate section in the case where the NVIDIA CUDA Toolkit is already installed on the nodes.
  • To work with containerd (e.g. for Tanzu), change the defaultRuntime accordingly.
Run:ai 2.3 or earlier
kubectl -n gpu-operator patch daemonset nvidia-device-plugin-daemonset \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

(use the namespace gpu-operator-resources if the NVIDIA GPU Operator is of version 1.8 or earlier )

Prometheus

The Run:ai Cluster installation will, by default, install Prometheus, but it can also connect to an existing Prometheus instance installed by the organization. In the latter case, it's important to:

Distributed Training via Kubeflow MPI

Distributed training is the ability to run workloads on multiple nodes (not just multiple GPUs on the same node). Run:ai provides this capability via Kubeflow MPI. If you need this functionality, you will need to install the Kubeflow MPI Operator. Use the following installation guide. Note that Run:ai does not currently support the latest MPI version. but only version v0.2.3.

As per instructions:

  • Verify that the mpijob custom resource does not currently exist in the cluster by running kubectl get crds | grep mpijobs. If it does, delete it by running kubectl delete crd mpijobs.kubeflow.org
  • Clone tag v0.2.3 (and not master)
  • vi mpi-operator/deploy/v1alpha2/mpi-operator.yaml:
  • search for mpioperator/mpi-operator:latest and change it to mpioperator/mpi-operator:v0.2.3.
  • search for mpioperator/kubectl-delivery:latest and change it to mpioperator/kubectl-delivery:v0.2.3.

Notes

Kubeflow MPI requires containers to run as root, which will not work well when running on OpenShift or when PodSecurityPolicy is enabled in Kubernetes.

Hardware Requirements

(see picture below)

  • (Production only) Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:

    • 4 CPUs
    • 8GB of RAM
    • 50GB of Disk space
  • Shared data volume: Run:ai uses Kubernetes to abstract away the machine on which a container is running:

    • Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
    • The Run:ai system needs to save data on a storage device that is not dependent on a specific node.

    Typically, this is achieved via Network File Storage (NFS) or Network-attached storage (NAS).

  • Docker Registry: With Run:ai, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-prem (preferably on a dedicated machine). Run:ai can assist with setting up the repository.

  • Kubernetes: Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation.

img/prerequisites.png

User requirements

Usage of containers and images: The individual Researcher's work should be based on container images.

Network Requirements

Internal networking: Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes all cluster nodes can interconnect using all ports.

Outbound network: Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is proxied/limited, the following exceptions should be applied:

During Installation

Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:

Name Description URLs Ports

Run:ai Repository

The Run:ai Package Repository is hosted on Run:ai’s account on Google Cloud

runai-charts.storage.googleapis.com

443

Docker Images Repository

Various Run:ai images

hub.docker.com

gcr.io/run-ai-prod

443

Docker Images Repository

Various third party Images

quay.io

443

Post Installation

In addition, once running, Run:ai will send metrics to two sources:

Name Description URLs Ports

Grafana

Grafana Metrics Server

prometheus-us-central1.grafana.net

443

Run:ai

Run:ai Cloud instance

app.run.ai

443

Auth0

Authentication Provider (older tenants only)

runai-prod.auth0.com

443

Pre-install Script

Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:

  • Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
  • Looks at additional components installed and analyzes their relevancy to a successful Run:ai installation.

To use the script download the latest version of the script and run:

chmod +x preinstall-diagnostics-<platform>
./preinstall-diagnostics-<platform>

If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:ai, locate the file runai-preinstall-diagnostics.txt in the current directory and send it to Run:ai technical support.

For more information on the script including additional command-line flags, see here.


Last update: April 17, 2022