Install a Cluster
Install NVIDIA Dependencies¶
Note
You must have Cluster Administrator rights to install these dependencies.
Before installing Run:ai, you must install NVIDIA software on your OpenShift cluster to enable GPUs. NVIDIA has provided detailed documentation. Follow the instructions to install the two operators Node Feature Discovery
and NVIDIA GPU Operator
from the OpenShift web console.
When done, verify that the GPU Operator is installed by running:
(the GPU Operator namespace may differ in different operator versions).
Create OpenShift Projects¶
Run:ai cluster installation uses several namespaces (or projects in OpenShift terminology). The installation will automatically create the namespaces, but if your organization requires manual creation of namespaces, you must create them before installing:
The last namespace (runai-scale-adjust
) is only required if the cluster is a cloud cluster and is configured for auto-scaling.
Monitoring Pre-check¶
Not required
Run:ai uses the OpenShift monitoring stack. As such, it requires creating or changing the OpenShift monitoring configuration. Check if a configmap
already exists:
If it does,
- To the cluster values file, add the flag
createOpenshiftMonitoringConfig
as described underCluster Installation
below. - Post-installation, edit the
configmap
by running:oc edit configmap cluster-monitoring-config -n openshift-monitoring
. Add the following:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
scrapeInterval: "10s"
evaluationInterval: "10s"
externalLabels:
clusterId: <CLUSTER_ID>
prometheus: ""
prometheus_replica: ""
<CLUSTER_ID>
use the Cluster UUID
field as shown in the Run:ai user interface under the Clusters
area. Cluster Installation¶
Perform the cluster installation instructions explained here. When creating a new cluster, select the OpenShift target platform.
Attention
The cluster wizard shows extra commands which are unique to OpenShift. Remember to run them all.
Optional configuration¶
Make the following changes to the configuration file you have downloaded:
Key | Change | Description |
---|---|---|
createOpenshiftMonitoringConfig | false | see Monitoring Pre-check above. |
runai-operator.config.project-controller.createNamespaces | true | Set to false if unwilling to provide Run:ai the ability to create namespaces, or would want to create namespaces manually rather than use the Run:ai convention of runai-<PROJECT-NAME> . When set to false , will require an additional manual step when creating new Run:ai Projects. |
runai-operator.config.mps-server.enabled | Default is false | Allow the use of NVIDIA MPS. MPS is useful with Inference workloads. Requires extra permissions |
runai-operator.config.runai-container-toolkit.enabled | Default is true | Controls the usage of Fractions. Requires extra permissions |
runai-operator.config.runaiBackend.password | Default password already set | admin@run.ai password. Need to change only if you have changed the password here |
runai-operator.config.global.prometheusService.address | The address of the default Prometheus Service | If you installed your own custom Prometheus Service, change to its' address |
Run:
Follow the instructions on the Cluster Wizard
Info
To install a specific version, add --version <version>
to the install command. You can find available versions by running helm search repo -l runai-cluster
.
oc label ns runai openshift.io/cluster-monitoring=true
oc -n openshift-ingress-operator patch ingresscontroller/default --patch '{"spec":{"routeAdmission":{"namespaceOwnership":"InterNamespaceAllowed"}}}' --type=merge
helm install runai-cluster -n runai \
runai-cluster-<version>.tgz -f runai-<cluster-name>.yaml
Tip
Use the --dry-run
flag to gain an understanding of what is being installed before the actual installation. For more details see understanding cluster access roles.
Connect Run:ai to GPU Operator¶
(Optional) Prometheus Adapter for Inference¶
The Prometheus adapter is required if you are using Inference workloads and require a custom metric for autoscaling. The following additional steps are required for it to work:
- Copy
prometheus-adapter-prometheus-config
andserving-certs-ca-bundle
ConfigMaps fromopenshift-monitoring
namespace to themonitoring
namespacekubectl get cm prometheus-adapter-prometheus-config --namespace=openshift-monitoring -o yaml \ | sed 's/namespace: openshift-monitoring/namespace: monitoring/' \ | kubectl create -f - kubectl get cm serving-certs-ca-bundle --namespace=openshift-monitoring -o yaml \ | sed 's/namespace: openshift-monitoring/namespace: monitoring/' \ | kubectl create -f -
- Allow Prometheus Adapter
serviceaccount
to create aSecurityContext
with RunAsUser 10001:
Next Steps¶
Continue to create Run:ai Projects.
Created: 2023-03-26