Install a Cluster
Prerequisites¶
Note
You must have Cluster Administrator rights to install these dependencies.
Before installing Run:ai, you must install NVIDIA software on your OpenShift cluster to enable GPUs. NVIDIA has provided detailed documentation. Follow the instructions to install the two operators Node Feature Discovery
and NVIDIA GPU Operator
from the OpenShift web console.
When done, verify that the GPU Operator is installed by running:
(the GPU Operator namespace may differ in different operator versions).
Create OpenShift Projects¶
Run:ai cluster installation uses several namespaces (or projects in OpenShift terminology). Run the following:
The last namespace (runai-scale-adjust
) is only required if the cluster is a cloud cluster and is configured for auto-scaling.
Cluster Installation¶
Perform the cluster installation instructions explained here. When creating a new cluster, select the OpenShift target platform.
Info
To install a specific version, add --version <version>
to the install command. You can find available versions by running helm search repo -l runai-cluster
.
Perform the cluster installation instructions explained here. When creating a new cluster, select the OpenShift target platform. The helm
install command should use the runai-cluster
tar file:
Optional Configuration¶
Make the following changes to the configuration file you have downloaded:
Key | Change | Description |
---|---|---|
runai-operator.config.project-controller.createNamespaces | true | Set to false if unwilling to provide Run:ai the ability to create namespaces, or would want to create namespaces manually rather than use the Run:ai convention of runai-<PROJECT-NAME> . When set to false , will require an additional manual step when creating new Run:ai Projects. |
runai-operator.config.project-controller.clusterWideSecret | true | Set to false if unwilling to provide Run:ai the ability to create Kubernetes Secrets. When not enabled, automatic secret propagation will not be available |
runai-operator.config.mps-server.enabled | Default is false | Allow the use of NVIDIA MPS. MPS is useful with Inference workloads. Requires extra permissions |
runai-operator.config.runai-container-toolkit.enabled | true | Controls the usage of Fractions. Requires extra cluster permissions |
runai-operator.config.runaiBackend.password | Default password already set | [email protected] password. Need to change only if you have changed the password here |
Tip
Use the --dry-run
flag to gain an understanding of what is being installed before the actual installation. For more details see understanding cluster access roles.
(Optional) Prometheus Adapter for Inference¶
The Prometheus adapter is required if you are using Inference workloads and require a custom metric for autoscaling. The following additional steps are required for it to work:
- Copy
prometheus-adapter-prometheus-config
andserving-certs-ca-bundle
ConfigMaps fromopenshift-monitoring
namespace to themonitoring
namespacekubectl get cm prometheus-adapter-prometheus-config --namespace=openshift-monitoring -o yaml \ | sed 's/namespace: openshift-monitoring/namespace: monitoring/' \ | kubectl create -f - kubectl get cm serving-certs-ca-bundle --namespace=openshift-monitoring -o yaml \ | sed 's/namespace: openshift-monitoring/namespace: monitoring/' \ | kubectl create -f -
- Allow Prometheus Adapter
serviceaccount
to create aSecurityContext
with RunAsUser 10001:
Next Steps¶
Continue to create Run:ai Projects.