Skip to content

(Optional) Customize Cluster Installation

The Run:ai cluster creation wizard requires the download of a Helm values file runai-<cluster-name>.yaml. The file may be edited to customize the cluster installation.

Configuration Flags

Key Default Description
runai-operator.config.project-controller.createNamespaces true Set to falseif unwilling to provide Run:ai the ability to create namespaces. When set to false, will requires an additional manual step when creating new Run:ai Projects
runai-operator.config.project-controller.clusterWideSecret true Set to false when using PodSecurityPolicy or OpenShift
runai-operator.config.mps-server.enabled false Set to true to allow the use of NVIDIA MPS. MPS is useful with Inference workloads
runai-operator.config.global.runtime docker Defines the container runtime of the cluster (supports docker and containerd). Set to containerd when using Tanzu
runai-operator.config.global.nvidiaDcgmExporter.namespace gpu-operator The namespace where dcgm-exporter (or gpu-operator) was installed
runai-operator.config.global.nvidiaDcgmExporter.installedFromGpuOperator true Indicated whether the dcgm-exporter was installed via gpu-operator or not
kube-prometheus-stack.enabled true Set to false when the cluster has an existing Prometheus installation that is not based on the Prometheus operator. This setting requires Run:ai customer support
kube-prometheus-stack.prometheusOperator.enabled true Set to false when the cluster has an existing Prometheus installation based on the Prometheus operator and Run:ai should use the existing one rather than install a new one
prometheus-adapter.enabled false Install Prometheus Adapter. Used for Inference workloads using a custom metric for autoscaling. Set to true if Prometheus Adapter is not already installed in the cluster
prometheus-adapter.prometheus The address of the default Prometheus Service If you installed your own custom Prometheus Service, set this field accordingly with url and port

Prometheus

The Run:ai Cluster installation uses Prometheus. There are 3 alternative configurations:

  1. Run:ai installs Prometheus (default).
  2. Run:ai uses an existing Prometheus installation based on the Prometheus operator.
  3. Run:ai uses an existing Prometheus installation based on a regular Prometheus installation.

For option 2, disable the flag kube-prometheus-stack.prometheusOperator.enabled. For option 3, please contact Run:ai Customer support.

For options 2 and 3, if you enabled prometheus-adapter, please configure it as described in the Prometheus Adapter documentation

Understanding Custom Access Roles

To review the access roles created by the Run:ai Cluster installation, see Understanding Access Roles.

Manual Creation of Namespaces

Run:ai Projects are implemented as Kubernetes namespaces. By default, the administrator creates a new Project via the Administration user interface which then triggers the creation of a Kubernetes namespace named runai-<PROJECT-NAME>. There are a couple of use cases that customers will want to disable this feature:

  • Some organizations prefer to use their internal naming convention for Kubernetes namespaces, rather than Run:ai's default runai-<PROJECT-NAME> convention.
  • Some organizations will not allow Run:ai to automatically create Kubernetes namespaces.

Follow these steps to achieve this:

  1. Disable the namespace creation functionality. See the runai-operator.config.project-controller.createNamespaces flag above.
  2. Create a Project using the Run:ai User Interface.
  3. Create the namespace if needed by running: kubectl create ns <NAMESPACE>. The suggested Run:ai default is runai-<PROJECT-NAME>.
  4. Label the namespace to connect it to the Run:ai Project by running kubectl label ns <NAMESPACE> runai/queue=<PROJECT_NAME>, where <PROJECT_NAME> is the name of the project you have created in the Run:ai user interface above and <NAMESPACE> is the name you chose for your namespace.

RKE-Specific Setup

Rancher Kubernetes Engine (RKE) requires additional steps:

Certificate Signing Request (RKE1 only)

When creating a cluster on the Run:ai user interface:

  • Download the "On Premise" Kubernetes type.
  • Edit the cluster values file and change useCertManager to true
init-ca:
    enabled: true
    useCertManager: true

On the cluster, install Cert manager as follows:

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

NGINX (both RKE1 and RKE2)

RKE comes pre-installed with NGINX. Thus, the Run:ai prerequisite for ingress controller is not needed.

Researcher Authentication

See the RKE and RKE2 tabs in the Researcher Authentication document, on how to set up researcher authentication.


Last update: 2023-01-02
Created: 2021-03-31