Prerequisites
Below are the prerequisites of a cluster installed with Run:ai.
Prerequisites in a Nutshell¶
The following is a checklist of the Run:ai prerequisites:
Prerequisite | Details |
---|---|
Kubernetes | Verify certified vendor and correct version. |
NVIDIA GPU Operator | Different Kubernetes flavors have slightly different setup instructions. Verify correct version. |
Ingress Controller | Install and configure NGINX (some Kubernetes flavors have NGINX pre-installed). Version 2.7 or earlier of Run:ai already installs NGINX as part of the Run:ai cluster installation. |
Prometheus | Install Prometheus. Version 2.8 or earlier of Run:ai already installs Prometheus as part of the Run:ai cluster installation. |
Trusted domain name | You must provide a trusted domain name (Version 2.7: a cluster IP). Accessible only inside the organization |
Cert manager | For RKE and EKS, you must install a certificate manager and configure Run:ai to use it |
(Optional) Distributed Training | Install Kubeflow MPI if required. |
(Optional) Inference | Some third party software needs to be installed to use the Run:ai inference module. |
There are also specific hardware, operating system and network access requirements. A pre-install script is available to test if the prerequisites are met before installation.
Software Requirements¶
Operating System¶
Run:ai will work on any Linux operating system that is supported by both Kubernetes and NVIDIA. Having said that, Run:ai performs its internal tests on Ubuntu 20.04 (and CoreOS for OpenShift). The exception is GKE (Google Kubernetes Engine) where Run:ai has only been certified on an Ubuntu image.
Kubernetes¶
Run:ai requires Kubernetes. The latest Run:ai version supports Kubernetes versions 1.21 through 1.26 and OpenShift 4.8 to 4.11. For an up-to-date end-of-life statement of Kubernetes see Kubernetes Release History.
Run:ai does not support Pod Security Admission.
Run:ai has been tested with the following Kubernetes distributions:
Target Platform | Description | Installation Notes |
---|---|---|
Vanilla Kubernetes | Using no specific distribution but rather k8s native installation | |
OCP | OpenShift Container Platform | The Run:ai operator is certified for OpenShift by Red Hat. |
EKS | Amazon Elastic Kubernetes Service | |
AKS | Azure Kubernetes Services | |
GKE | Google Kubernetes Engine | |
RKE | Rancher Kubernetes Engine | When installing Run:ai, select On Premise. RKE2 has a defect which requires a specific installation flow. Please contact Run:ai customer support for additional details. |
Bright | NVIDIA Bright Cluster Manager | In addition, NVIDIA DGX comes bundled with Run:ai |
Ezmeral | HPE Ezmeral Container Platform | See Run:ai at Ezmeral marketplace |
Tanzu | VMWare Kubernetes | Tanzu supports containerd rather than docker. See the NVIDIA prerequisites below as well as cluster customization for changes required for containerd |
Run:ai provides instructions for a simple (non-production-ready) Kubernetes Installation.
Notes
- Kubernetes recommends the usage of the
systemd
as the container runtime cgroup driver. Kubernetes 1.22 and above defaults tosystemd
. - Run:ai 2.8 or earlier Supports Kubernetes Pod Security Policy if used. Pod Security Policy is deprecated and will be removed from Kubernetes 1.25. As such, Run:ai has removed support for PSP in Run:ai 2.9
NVIDIA¶
Run:ai requires NVIDIA GPU Operator version 1.9 or 22.9. The interim versions (1.10 and 1.11) have a documented issue as per the note below.
Important
NVIDIA GPU Operator has a bug that affects metrics and scheduling. The bug affects NVIDIA GPU Operator versions 1.10 and 1.11 but does not exist in 1.9 and is resolved in 22.9.0. For more details see NVIDIA bug report.
Follow the Getting Started guide to install the NVIDIA GPU Operator.
- Do not install the NVIDIA device plug-in (as we want the NVIDIA GPU Operator to install it instead). When using the eksctl tool to create an AWS EKS cluster, use the flag
--install-nvidia-plugin=false
to disable this install. - Follow the Getting Started guide to install the NVIDIA GPU Operator. For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags:
--set driver.enabled=false
.
Create the gpu-operator
namespace by running
Before installing the GPU Operator you must create the following file:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gcp-critical-pods
namespace: gpu-operator
spec:
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- system-node-critical
- system-cluster-critical
Then run: kubectl apply -f resourcequota.yaml
Important
- Run:ai on GKE has only been tested with GPU Operator version 1.11.1 and up.
- The above only works for Run:ai 2.7.16 and above.
Install the NVIDIA GPU Operator as discussed here.
Notes
- Use the default namespace
gpu-operator
. Otherwise, you must specify the target namespace using the flagrunai-operator.config.nvidiaDcgmExporter.namespace
as described in customized cluster installation. - NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags
--set driver.enabled=false
. DGX OS is one such example as it comes bundled with NVIDIA Drivers. - To work with containerd (e.g. for Tanzu), use the defaultRuntime flag accordingly.
- To use Dynamic MIG, the GPU Operator must be installed with the flag
mig.strategy=mixed
. If the GPU Operator is already installed, edit the clusterPolicy by runningkubectl patch clusterPolicy cluster-policy -n gpu-operator --type=merge -p '{"spec":{"mig":{"strategy": "mixed"}}}
Ingress Controller¶
Version 2.8 and up.
Run:ai requires an ingress controller as a prerequisite. The Run:ai cluster installation configures one or more ingress objects on top of the controller.
There are many ways to install and configure an ingress controller and configuration is environment-dependent. A simple solution is to install & configure NGINX:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
--namespace nginx-ingress --create-namespace \
--set controller.kind=DaemonSet \
--set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # (1)
- External and internal IP of one of the nodes
For support of ingress controllers different than NGINX please contact Run:ai customer support.
Note
In a self-hosted installation, the typical scenario is to install the first Run:ai cluster on the same Kubernetes cluster as the control plane. In this case, there is no need to install an ingress controller as it is pre-installed by the control plane.
Cluster URL¶
The Run:ai user interface requires a URL to the Kubernetes cluster. You can use a domain name (FQDN) or an IP Address (deprecated).
Note
In a self-hosted installation, the typical scenario is to install the first Run:ai cluster on the same Kubernetes cluster as the control plane. In this case, the cluster URL need not be provided as it will be the same as the control-plane URL.
Domain Name¶
Version 2.8 and up.
You must supply a domain name as well as a trusted certificate for that domain.
Use an HTTPS-based domain (e.g. https://my-cluster.com) as the cluster URL. Make sure that the DNS is configured with the cluster IP.
In addition, to configure HTTPS for your URL, you must create a TLS secret named runai-cluster-domain-tls-secret
in the runai
namespace. The secret should contain a trusted certificate for the domain:
kubectl create ns runai
kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
--cert /path/to/fullchain.pem \ # (1)
--key /path/to/private.pem # (2)
- The domain's cert (public key).
- The domain's private key.
For more information on how to create a TLS secret see: https://kubernetes.io/docs/concepts/configuration/secret/#tls-secrets.
Cluster IP¶
(Deprecated in version 2.8, no longer available in version 2.9)
Following are instructions on how to get the IP and set firewall settings.
- Use the node IP of any of the Kubernetes nodes.
- Set up the firewall such that the IP is available to Researchers running within the organization (but not outside the organization).
- Use the node IPs of any of the Kubernetes nodes. Both internal and external IP in the format external-IP,internal-IP.
- Set up the firewall such that the external IP is available to Researchers running within the organization (but not outside the organization).
You will need to externalize an IP address via a load balancer. If you do not have an existing load balancer already, install NGINX as follows:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx
Find the Cluster IP by running:
You will need to externalize an IP address via a load balancer. If you do not have an existing load balancer already, install NGINX as follows:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx
Find the Cluster IP by running:
Prometheus¶
If not already installed on your cluster, install the full kube-prometheus-stack
through the Prometheus community Operator.
Note
Due to a Prometheus bug described here you may need to apply Prometheus CRDs before installing Prometheus.
Then install the Prometheus stack by running:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace --set grafana.enabled=false # (1)
- The Grafana component is not required for Run:ai.
The Run:ai Cluster installation will, by default, install Prometheus, but it can also connect to an existing Prometheus instance installed by the organization. Such a configuration can only be performed together with Run:ai support. In the latter case, it's important to:
- Verify that both Prometheus Node Exporter and kube-state-metrics are installed. Both are part of the default Prometheus installation
- Understand how Prometheus has been installed. Whether directly or with the Prometheus Operator. The distinction is important during the Run:ai Cluster installation.
Cert Manager¶
Rancher Kubernetes Engine (RKE) and Amazon Elastic Kubernetes Engine (EKS) require a certificate manager as described here. Example:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true
For RKE only, you must then configure Run:ai to use the cert-manager. When creating a cluster on the Run:ai user interface:
- Download the "On Premise" Kubernetes type.
- Edit the cluster values file and change
useCertManager
totrue
Distributed Training¶
Distributed training is the ability to run workloads on multiple nodes (not just multiple GPUs on the same node). Run:ai provides this capability via Kubeflow MPI. If you need this functionality, you will need to install the Kubeflow MPI Operator. Run:ai supports MPI version 0.3.0 (the latest version at the time of writing).
Use the following installation guide. As per instruction, run:
- Verify that the
mpijob
custom resource does not currently exist in the cluster by runningkubectl get crds | grep mpijobs
. If it does, delete it by runningkubectl delete crd mpijobs.kubeflow.org
- run
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
Inference¶
To use the Run:ai inference module you must pre-install Knative Serving. Follow the instructions here to install. Run:ai is certified on Knative 1.4 to 1.8 with Kubernetes 1.22 or later.
Post-install, you must configure Knative to use the Run:ai scheduler and allow pod affinity, by running:
kubectl patch configmap/config-features \
--namespace knative-serving \
--type merge \
--patch '{"data":{"kubernetes.podspec-schedulername":"enabled","kubernetes.podspec-affinity":"enabled"}}'
Inference Autoscaling¶
Run:ai allows to autoscale a deployment according to various metrics:
- GPU Utilization (%)
- CPU Utilization (%)
- Latency (milliseconds)
- Throughput (requests/second)
- Concurrency
- Any custom metric
Additional installation may be needed for some of the metrics as follows:
- Using Throughput or Concurrency does not require any additional installation.
- Any other metric will require installing the HPA Autoscaler.
- Using GPU Utilization, Latency or Custom metric will also require the Prometheus adapter. The Prometheus adapter is part of the Run:ai installer and can be added by setting the
prometheus-adapter.enabled
flag totrue
. See Customizing the Run:ai installation for further information.
If you wish to use an existing Prometheus adapter installation, you will need to configure it manually with the Run:ai Prometheus rules, specified in the Run:ai chart values under prometheus-adapter.rules
field. For further information please contact Run:ai customer support.
Accessing Inference from outside the Cluster¶
Inference workloads will typically be accessed by consumers residing outside the cluster. You will hence want to provide consumers with a URL to access the workload. The URL can be found in the Run:ai user interface under the deployment screen (alternatively, run kubectl get ksvc -n <project-namespace>
).
However, for the URL to be accessible outside the cluster you must configure your DNS as described here.
Altenative Configuration
When the above DNS configuration is not possible, you can manually add the Host
header to the REST request as follows:
- Get an
<external-ip>
by runningkubectl get service -n kourier-system kourier
. If you have been using istio during Run:ai installation, run:kubectl -n istio-system get service istio-ingressgateway
instead. - Send a request to your workload by using the external ip, and place the workload url as a
Host
header. For example
Hardware Requirements¶
(see picture below)
-
(Production only) Run:ai System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain two or more worker machines, designated for Run:ai Software. The nodes do not have to be dedicated to Run:ai, but for Run:ai purposes we would need:
- 4 CPUs
- 8GB of RAM
- 50GB of Disk space
-
Shared data volume: Run:ai uses Kubernetes to abstract away the machine on which a container is running:
- Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
- The Run:ai system needs to save data on a storage device that is not dependent on a specific node.
Typically, this is achieved via Network File Storage (NFS) or Network-attached storage (NAS).
-
Docker Registry: With Run:ai, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-prem (preferably on a dedicated machine). Run:ai can assist with setting up the repository.
-
Kubernetes: Production Kubernetes installation requires separate nodes for the Kubernetes master. For more details see your specific Kubernetes distribution documentation.
User requirements¶
Usage of containers and images: The individual Researcher's work should be based on container images.
Network Access Requirements¶
Internal networking: Kubernetes networking is an add-on rather than a core part of Kubernetes. Different add-ons have different network requirements. You should consult the documentation of the specific add-on on which ports to open. It is however important to note that unless special provisions are made, Kubernetes assumes all cluster nodes can interconnect using all ports.
Outbound network: Run:ai user interface runs from the cloud. All container nodes must be able to connect to the Run:ai cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is proxied/limited, the following exceptions should be applied:
During Installation¶
Run:ai requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:
Name | Description | URLs | Ports |
---|---|---|---|
Run:ai Repository | Run:ai Helm Package Repository | 443 | |
Docker Images Repository | Run:ai images | gcr.io/run-ai-prod | 443 |
Docker Images Repository | Third party Images | 443 | |
Cert Manager | (Run:ai version 2.7 or lower only) Creates a letsencrypt-based certificate for the cluster | 8.8.8.8, 1.1.1.1, dynu.com
| 53 |
Post Installation¶
In addition, once running, Run:ai requires an outbound network connection to the following targets:
Name | Description | URLs | Ports |
---|---|---|---|
Grafana | Grafana Metrics Server | 443 | |
Run:ai | Run:ai Cloud instance |
| 443 |
Cert Manager | (Run:ai version 2.7 or lower only) Creates a letsencrypt-based certificate for the cluster | 8.8.8.8, 1.1.1.1, dynu.com
| 53 |
Pre-install Script¶
Once you believe that the Run:ai prerequisites are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:
- Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
- Looks at additional components installed and analyze their relevance to a successful Run:ai installation.
To use the script download the latest version of the script and run:
If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:ai, locate the file runai-preinstall-diagnostics.txt
in the current directory and send it to Run:ai technical support.
For more information on the script including additional command-line flags, see here.
Created: 2023-03-26