Prerequisites
Below are the prerequisites of a cluster installed with Run:ai.
Software Requirements¶
Kubernetes¶
Run:ai has been tested with the following certified Kubernetes distributions:
Target Platform | Description | Notes |
---|---|---|
Vanilla Kubernetes | Using no specific distribution but rather k8s native installation | |
EKS | Amazon Elastic Kubernetes Service | |
AKS | Azure Kubernetes Services | |
GKE | Google Kubernetes Engine | GKE has a different software stack for NVIDIA.To install Run:ai on GKE please contact customer support. |
OCP | OpenShift Container Platform | The Run:ai operator is certified for OpenShift by Red Hat. |
RKE | Rancher Kubernetes Engine | When installing Run:ai, select On Premise. You must perform the mandatory extra step here. |
Ezmeral | HPE Ezmeral Container Platform | See Run:ai at Ezmeral marketplace |
Tanzu | VMWare Kubernetes | Tanzu supports containerd rather than docker. See the NVIDIA prerequisites below as well as cluster customization for changes required for containerd |
Canonical Kubernetes | a.k.a Charmed Kubernetes |
A full list of Kubernetes partners can be found here: https://kubernetes.io/docs/setup/. In addition, Run:ai provides instructions for a simple (non production-ready) Kubernetes Installation.
Notes
- Run:ai requires Kubernetes. Supported versions are 1.19 through 1.23.
- Kubernetes recommends the usage of the
systemd
as the container runtime cgroup driver. Kubernetes 1.22 and above defaults tosystemd
. - If you are using RedHat OpenShift. Run:ai supports OpenShift 4.6 to 4.9.
- Run:ai Supports Kubernetes Pod Security Policy if used.
NVIDIA¶
Important
- If you are using DGX OS then NVIDIA prerequisites are already installed and you may skip to the next step.
- The combination of NVIDIA A100 hardware and the CoreOS operating system (which is popular when using OpenShift) will only work with GPU Operator version 1.8 or higher.
Follow the Getting Started guide to install the NVIDIA GPU Operator version 1.9 or higher.
- We recommend to use the default namespace
gpu-operator
. Otherwise, you must specify the target namespace using the flagrunai-operator.config.nvidiaDcgmExporter.namespace
as described in customized cluster installation. - Note that the NVIDIA document contains a separate section in the case where the NVIDIA CUDA Toolkit is already installed on the nodes.
- To work with containerd (e.g. for Tanzu), change the defaultRuntime accordingly.
Run:ai 2.3 or earlier
- Run:ai has customized the NVIDIA device plugin for Kubernetes and NVIDIA DCGM Exporter. Run the following to disable the existing plug-ins:
kubectl -n gpu-operator patch daemonset nvidia-device-plugin-daemonset \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
kubectl -n gpu-operator patch daemonset nvidia-dcgm-exporter \
-p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
(use the namespace gpu-operator-resources
if the NVIDIA GPU Operator is of version 1.8 or earlier )
Prometheus¶
The Run:ai Cluster installation will, by default, install Prometheus, but it can also connect to an existing Prometheus instance installed by the organization. In the latter case, it's important to:
- Verify that both Prometheus Node Exporter and kube-state-metrics are installed. Both are part of the default Prometheus installation
- Understand how Prometheus has been installed. Whether directly or with the Prometheus Operator. The distinction is important during the Run:ai Cluster installation.
Distributed Training via Kubeflow MPI¶
Distributed training is the ability to run workloads on multiple nodes (not just multiple GPUs on the same node). Run:ai provides this capability via Kubeflow MPI. If you need this functionality, you will need to install the Kubeflow MPI Operator. Use the following installation guide. Note that Run:ai does not currently support the latest MPI version. but only version v0.2.3
.
As per instructions:
- Verify that the
mpijob
custom resource does not currently exist in the cluster by runningkubectl get crds | grep mpijobs
. If it does, delete it by runningkubectl delete crd mpijobs.kubeflow.org
- Clone tag
v0.2.3
(and not master) vi mpi-operator/deploy/v1alpha2/mpi-operator.yaml
:- search for
mpioperator/mpi-operator:latest
and change it tompioperator/mpi-operator:v0.2.3
. - search for
mpioperator/kubectl-delivery:latest
and change it tompioperator/kubectl-delivery:v0.2.3
.
Notes
Kubeflow MPI requires containers to run as root, which will not work well when running on OpenShift or when PodSecurityPolicy is enabled in Kubernetes.
Hardware Requirements¶
(see picture below)
-
(Production only) Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:
- 4 CPUs
- 8GB of RAM
- 50GB of Disk space
-
Shared data volume: Run:ai uses Kubernetes to abstract away the machine on which a container is running:
- Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
- The Run:ai system needs to save data on a storage device that is not dependent on a specific node.
Typically, this is achieved via Network File Storage (NFS) or Network-attached storage (NAS).
-
Docker Registry: With Run:ai, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-prem (preferably on a dedicated machine). Run:ai can assist with setting up the repository.
-
Kubernetes: Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation.
User requirements¶
Usage of containers and images: The individual Researcher's work should be based on container images.
Network Requirements¶
Internal networking: Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes all cluster nodes can interconnect using all ports.
Outbound network: Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is proxied/limited, the following exceptions should be applied:
During Installation¶
Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:
Name | Description | URLs | Ports |
---|---|---|---|
Run:ai Repository | The Run:ai Package Repository is hosted on Run:ai’s account on Google Cloud | 443 | |
Docker Images Repository | Various Run:ai images | gcr.io/run-ai-prod | 443 |
Docker Images Repository | Various third party Images | 443 |
Post Installation¶
In addition, once running, Run:ai will send metrics to two sources:
Name | Description | URLs | Ports |
---|---|---|---|
Grafana | Grafana Metrics Server | prometheus-us-central1.grafana.net | 443 |
Run:ai | Run:ai Cloud instance |
| 443 |
Auth0 | Authentication Provider (older tenants only) |
| 443 |
Pre-install Script¶
Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:
- Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
- Looks at additional components installed and analyzes their relevancy to a successful Run:ai installation.
To use the script download the latest version of the script and run:
chmod +x preinstall-diagnostics-<platform>
./preinstall-diagnostics-<platform>
If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:ai, locate the file runai-preinstall-diagnostics.txt
in the current directory and send it to Run:ai technical support.
For more information on the script including additional command-line flags, see here.