Skip to content

Prerequisites

Below are the prerequisites of a cluster installed with Run:ai.

Software Requirements

Kubernetes

Run:ai requires Kubernetes. The latest Run:ai version supports Kubernetes versions 1.21 through 1.24. Kubernetes 1.25 is not yet supported. For RedHat OpenShift. Run:ai supports OpenShift 4.8 to 4.10.

Run:ai has been tested with the following Kubernetes distributions:

Target Platform Description Installation Notes
Vanilla Kubernetes Using no specific distribution but rather k8s native installation
EKS Amazon Elastic Kubernetes Service
AKS Azure Kubernetes Services
GKE Google Kubernetes Engine GKE has a different software stack for NVIDIA. To install Run:ai on GKE please contact customer support.
OCP OpenShift Container Platform The Run:ai operator is certified for OpenShift by Red Hat.
RKE Rancher Kubernetes Engine When installing Run:ai, select On Premise. You must perform the mandatory extra step here. RKE2 has a defect which requires a specific installation. Please contact Run:ai customer support for additional details.
Ezmeral HPE Ezmeral Container Platform See Run:ai at Ezmeral marketplace
Tanzu VMWare Kubernetes Tanzu supports containerd rather than docker. See the NVIDIA prerequisites below as well as cluster customization for changes required for containerd
Canonical Kubernetes a.k.a Charmed Kubernetes

Run:ai provides instructions for a simple (non-production-ready) Kubernetes Installation.

Notes

  • Kubernetes recommends the usage of the systemd as the container runtime cgroup driver. Kubernetes 1.22 and above defaults to systemd.
  • Run:ai Supports Kubernetes Pod Security Policy if used. Pod Security Policy is deprecated and will be removed from Kubernetes (and Run:ai) with the introduction of Kubernetes 1.25.

NVIDIA

Important

NVIDIA GPU Operator has a bug which effects metrics and scheduling. The bug effects NVIDIA GPU Operator versions 1.10 and 1.11 but does not exist in 1.9. We recommend using version 1.9 only. For more details see bug details.

Follow the Getting Started guide to install the NVIDIA GPU Operator version 1.9 or higher.

  • Do not install the NVIDIA device plug-in (as we want the NVIDIA GPU Operator to install it instead). When using the eksctl tool to create an AWS EKS cluster, use the flag --install-nvidia-plugin=false to disable this install.
  • Follow the Getting Started guide to install the NVIDIA GPU Operator version 1.9 or higher. For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: --set driver.enabled=false --set toolkit.enabled=false --set migManager.enabled=false.

Create the gpu-operator namespace by running

kubectl create ns gpu-operator

Before installing the GPU Operator you must create the following file:

resourcequota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gcp-critical-pods
  namespace: gpu-operator
spec:
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - system-node-critical
      - system-cluster-critical

Then run: kubectl apply -f resourcequota.yaml

Important

  • Run:ai on GKE has only been tested with GPU Operator version 1.11.1 and up.
  • The above only works for Run:ai 2.7.16 and above.

Install the NVIDIA GPU Operator as discussed here.

Notes

  • Use the default namespace gpu-operator. Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.
  • NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags --set driver.enabled=false --set toolkit.enabled=false --set migManager.enabled=false.
  • To work with containerd (e.g. for Tanzu), use the defaultRuntime flag accordingly.
  • To use Dynamic MIG, the GPU Operator must be installed with the flag mig.strategy=mixed. If the GPU Operator is already installed, edit the clusterPolicy by running kubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}
  • If you are using DGX OS then NVIDIA prerequisites are already installed and you may skip to the next step.
Run:ai 2.3 or earlier
kubectl -n gpu-operator patch daemonset nvidia-device-plugin-daemonset \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

(use the namespace gpu-operator-resources if the NVIDIA GPU Operator is of version 1.8 or earlier )

Operating System

Run:ai will work on any Linux operating system that is supported by both Kubernetes and NVIDIA. Having said that, Run:ai performs its internal tests on Ubuntu 20.04 (and CoreOS for OpenShift). The exception is GKE (Google Kubernetes Engine). At this point, Run:ai will only work with Ubuntu.

Prometheus

The Run:ai Cluster installation will, by default, install Prometheus, but it can also connect to an existing Prometheus instance installed by the organization. In the latter case, it's important to:

Inference

Version 2.5 and up.

To use the Run:ai inference module you must pre-install Knative Serving. Follow the instructions here to install. Run:ai is certified on Knative 1.4 and 1.5 with Kubernetes 1.22 or later.

Post-install, you must configure Knative to use the Run:ai scheduler by running:

kubectl patch configmap/config-features \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"kubernetes.podspec-schedulername":"enabled"}}'

Inference Autoscaling

Run:ai allows to autoscale a deployment according to various metrics:

  1. GPU Utilization (%)
  2. CPU Utilization (%)
  3. Latency (milliseconds)
  4. Throughput (requests/second)
  5. Concurrency
  6. Any custom metric

Additional installation may be needed for some of the metrics as follows:

  • Using Throughput or Concurrency does not require any additional installation.
  • Any other metric will require installing the HPA Autoscaler.
  • Using GPU Utilization, Latency or Custom metric will also require the Prometheus adapter. The Prometheus adapter is part of the Run:ai installer and can be added by setting the prometheus-adapter.enabled flag to true. See Customizing the Run:ai installation for further information.

If you wish to use an existing Prometheus adapter installation, you will need to configure it manually with the Run:ai Prometheus rules, specified in the Run:ai chart values under prometheus-adapter.rules field. For further information please contact Run:ai customer support.

Distributed Training via Kubeflow MPI

Distributed training is the ability to run workloads on multiple nodes (not just multiple GPUs on the same node). Run:ai provides this capability via Kubeflow MPI. If you need this functionality, you will need to install the Kubeflow MPI Operator.

Use the following installation guide. As per instruction:

  • Verify that the mpijob custom resource does not currently exist in the cluster by running kubectl get crds | grep mpijobs. If it does, delete it by running kubectl delete crd mpijobs.kubeflow.org
  • run kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml

Use the following installation guide. As per instruction:

  • Clone tag v0.2.3 (and not master)
  • vi mpi-operator/deploy/v1alpha2/mpi-operator.yaml
  • search for mpioperator/mpi-operator:latest and change it to mpioperator/mpi-operator:v0.2.3.
  • search for mpioperator/kubectl-delivery:latest and change it to mpioperator/kubectl-delivery:v0.2.3.

Notes

Kubeflow MPI requires containers to run as root. This needs special adjustments when running on OpenShift or when PodSecurityPolicy is enabled in Kubernetes. Redhat provides an article on how to configure MPI jobs to run as root on OpenShift.

Hardware Requirements

(see picture below)

  • (Production only) Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:

    • 4 CPUs
    • 8GB of RAM
    • 50GB of Disk space
  • Shared data volume: Run:ai uses Kubernetes to abstract away the machine on which a container is running:

    • Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
    • The Run:ai system needs to save data on a storage device that is not dependent on a specific node.

    Typically, this is achieved via Network File Storage (NFS) or Network-attached storage (NAS).

  • Docker Registry: With Run:ai, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-prem (preferably on a dedicated machine). Run:ai can assist with setting up the repository.

  • Kubernetes: Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation.

img/prerequisites.png

User requirements

Usage of containers and images: The individual Researcher's work should be based on container images.

Network Requirements

Internal networking: Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes all cluster nodes can interconnect using all ports.

Outbound network: Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is proxied/limited, the following exceptions should be applied:

During Installation

Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:

Name Description URLs Ports

Run:ai Repository

The Run:ai Package Repository is hosted on Run:ai’s account on Google Cloud

runai-charts.storage.googleapis.com

443

Docker Images Repository

Various Run:ai images

hub.docker.com

gcr.io/run-ai-prod

443

Docker Images Repository

Various third party Images

quay.io

443

Cert Manager

Creates a letsencrypt-based certificate for the cluster

8.8.8.8, 1.1.1.1, dynu.com

53

Post Installation

In addition, once running, Run:ai requires outbound network connection to the following targets:

Name Description URLs Ports

Grafana

Grafana Metrics Server

prometheus-us-central1.grafana.net

443

Run:ai

Run:ai Cloud instance

app.run.ai

443

Auth0

Authentication Provider (older tenants only)

runai-prod.auth0.com

443

Cert Manager

Creates a letsencrypt-based certificate for the cluster

8.8.8.8, 1.1.1.1, dynu.com

53

Cluster URL

The Run:ai user interface requires a URL address to the Kubernetes cluster. The requirement is relevant for SaaS installation only.

Following are instructions on how to get the IP and set firewall settings.

  • Use the node IP of any of the Kubernetes nodes.
  • Set up the firewall such that the IP is available to Researchers running within the organization (but not outside the organization).
  • Use the node IPs of any of the Kubernetes nodes. Both internal and external IP in the format external-IP,internal-IP.
  • Set up the firewall such that the external IP is available to Researchers running within the organization (but not outside the organization).

You will need to externalize an IP address via a load balancer. If you do not have an existing load balancer already, install NGINX as follows:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx

Find the Cluster IP by running:

echo $(kubectl get svc nginx-ingress-ingress-nginx-controller  -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}')

You will need to externalize an IP address via a load balancer. If you do not have an existing load balancer already, install NGINX as follows:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx

Find the Cluster IP by running:

echo $(kubectl get svc nginx-ingress-ingress-nginx-controller  -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')

Pre-install Script

Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:

  • Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
  • Looks at additional components installed and analyzes their relevancy to a successful Run:ai installation.

To use the script download the latest version of the script and run:

chmod +x preinstall-diagnostics-<platform>
./preinstall-diagnostics-<platform>

If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:ai, locate the file runai-preinstall-diagnostics.txt in the current directory and send it to Run:ai technical support.

For more information on the script including additional command-line flags, see here.


Last update: 2022-09-20
Created: 2020-07-19