Install a Cluster

Prerequisites¶

Note

You must have Cluster Administrator rights to install these dependencies.

Before installing Run:ai, you must install NVIDIA software on your OpenShift cluster to enable GPUs. NVIDIA has provided detailed documentation. Follow the instructions to install the two operators Node Feature Discovery and NVIDIA GPU Operator from the OpenShift web console.

When done, verify that the GPU Operator is installed by running:

oc get pods -n nvidia-gpu-operator

(the GPU Operator namespace may differ in different operator versions).

Create OpenShift Projects¶

Run:ai cluster installation uses several namespaces (or projects in OpenShift terminology). Run the following:

oc new-project runai
oc new-project runai-reservation
oc new-project runai-scale-adjust

The last namespace (runai-scale-adjust) is only required if the cluster is a cloud cluster and is configured for auto-scaling.

Cluster Installation¶

ConnectedAirgapped

Perform the cluster installation instructions explained here. When creating a new cluster, select the OpenShift target platform.

Info

To install a specific version, add --version <version> to the install command. You can find available versions by running helm search repo -l runai-cluster.

Perform the cluster installation instructions explained here. When creating a new cluster, select the OpenShift target platform. The helm install command should use the runai-cluster tar file:

helm install runai-cluster -n runai  \ 
  runai-cluster-<version>.tgz -f runai-<cluster-name>.yaml

Optional Configuration¶

Make the following changes to the configuration file you have downloaded:

Key	Change	Description
`runai-operator.config.project-controller.createNamespaces`	`true`	Set to `false` if unwilling to provide Run:ai the ability to create namespaces, or would want to create namespaces manually rather than use the Run:ai convention of `runai-<PROJECT-NAME>`. When set to `false`, will require an additional manual step when creating new Run:ai Projects.
`runai-operator.config.project-controller.clusterWideSecret`	`true`	Set to `false` if unwilling to provide Run:ai the ability to create Kubernetes Secrets. When not enabled, automatic secret propagation will not be available
`runai-operator.config.mps-server.enabled`	Default is `false`	Allow the use of NVIDIA MPS. MPS is useful with Inference workloads. Requires extra permissions
`runai-operator.config.runai-container-toolkit.enabled`	`true`	Controls the usage of Fractions. Requires extra cluster permissions
`runai-operator.config.runaiBackend.password`	Default password already set	[email protected] password. Need to change only if you have changed the password here

Tip

Use the --dry-run flag to gain an understanding of what is being installed before the actual installation. For more details see understanding cluster access roles.

(Optional) Prometheus Adapter for Inference¶

The Prometheus adapter is required if you are using Inference workloads and require a custom metric for autoscaling. The following additional steps are required for it to work:

Copy prometheus-adapter-prometheus-config and serving-certs-ca-bundle ConfigMaps from openshift-monitoring namespace to the monitoring namespace

kubectl get cm prometheus-adapter-prometheus-config --namespace=openshift-monitoring -o yaml \
  | sed 's/namespace: openshift-monitoring/namespace: monitoring/' \
  | kubectl create -f -
kubectl get cm serving-certs-ca-bundle --namespace=openshift-monitoring -o yaml \
  | sed 's/namespace: openshift-monitoring/namespace: monitoring/' \
  | kubectl create -f -

Allow Prometheus Adapter serviceaccount to create a SecurityContext with RunAsUser 10001:

oc adm policy add-scc-to-user anyuid system:serviceaccount:monitoring:runai-cluster-prometheus-adapter

Next Steps¶

Continue to create Run:ai Projects.