(Optional) Customize Cluster Installation¶
The Run:ai cluster creation wizard requires the download of a Helm values file runai-<cluster-name>.yaml
. The file may be edited to customize the cluster installation.
Configuration Flags¶
Key | Default | Description |
---|---|---|
runai-operator.config.project-controller.createNamespaces | true | Set to false if unwilling to provide Run:ai the ability to create namespaces. When set to false, will requires an additional manual step when creating new Run:ai Projects |
runai-operator.config.project-controller.clusterWideSecret | true | Set to false when using PodSecurityPolicy or OpenShift |
runai-operator.config.mps-server.enabled | false | Set to true to allow the use of NVIDIA MPS. MPS is useful with Inference workloads |
runai-operator.config.global.runtime | docker | Defines the container runtime of the cluster (supports docker and containerd ). Set to containerd when using Tanzu |
runai-operator.config.global.nvidiaDcgmExporter.namespace | gpu-operator | The namespace where dcgm-exporter (or gpu-operator) was installed |
runai-operator.config.global.nvidiaDcgmExporter.installedFromGpuOperator | true | Indicated whether the dcgm-exporter was installed via gpu-operator or not |
kube-prometheus-stack.enabled | true | (Version 2.8 or lower) Set to false when the cluster has an existing Prometheus installation that is not based on the Prometheus operator. This setting requires Run:ai customer support |
kube-prometheus-stack.prometheusOperator.enabled | true | (Version 2.8 or lower) Set to false when the cluster has an existing Prometheus installation based on the Prometheus operator and Run:ai should use the existing one rather than install a new one |
prometheus-adapter.enabled | false | (Version 2.8 or lower) Install Prometheus Adapter. Used for Inference workloads using a custom metric for autoscaling. Set to true if Prometheus Adapter is not already installed in the cluster |
prometheus-adapter.prometheus | The address of the default Prometheus Service | (Version 2.8 or lower) If you installed your own custom Prometheus Service, set this field accordingly with url and port |
Prometheus¶
Not relevant
The Run:ai Cluster installation uses Prometheus. There are 3 alternative configurations:
- Run:ai installs Prometheus (default).
- Run:ai uses an existing Prometheus installation based on the Prometheus operator.
- Run:ai uses an existing Prometheus installation based on a regular Prometheus installation.
For option 2, disable the flag kube-prometheus-stack.prometheusOperator.enabled
. For option 3, please contact Run:ai Customer support.
For options 2 and 3, if you enabled prometheus-adapter
, please configure it as described in the Prometheus Adapter documentation
Understanding Custom Access Roles¶
To review the access roles created by the Run:ai Cluster installation, see Understanding Access Roles.
Manual Creation of Namespaces¶
Run:ai Projects are implemented as Kubernetes namespaces. By default, the administrator creates a new Project via the Administration user interface which then triggers the creation of a Kubernetes namespace named runai-<PROJECT-NAME>
. There are a couple of use cases that customers will want to disable this feature:
- Some organizations prefer to use their internal naming convention for Kubernetes namespaces, rather than Run:ai's default
runai-<PROJECT-NAME>
convention. - Some organizations will not allow Run:ai to automatically create Kubernetes namespaces.
Follow these steps to achieve this:
- Disable the namespace creation functionality. See the
runai-operator.config.project-controller.createNamespaces
flag above. - Create a Project using the Run:ai User Interface.
- Create the namespace if needed by running:
kubectl create ns <NAMESPACE>
. The suggested Run:ai default isrunai-<PROJECT-NAME>
. - Label the namespace to connect it to the Run:ai Project by running
kubectl label ns <NAMESPACE> runai/queue=<PROJECT_NAME>
, where<PROJECT_NAME>
is the name of the project you have created in the Run:ai user interface above and<NAMESPACE>
is the name you chose for your namespace.
RKE-Specific Setup¶
Rancher Kubernetes Engine (RKE) requires additional steps:
Certificate Signing Request (RKE1 only)¶
When creating a cluster on the Run:ai user interface:
- Download the "On Premise" Kubernetes type.
- Edit the cluster values file and change
useCertManager
totrue
On the cluster, install Cert manager as follows:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true
NGINX (both RKE1 and RKE2)¶
RKE comes pre-installed with NGINX. Thus, the Run:ai prerequisite for ingress controller is not needed.
Researcher Authentication¶
See the RKE and RKE2 tabs in the Researcher Authentication document, on how to set up researcher authentication.
Created: 2023-03-26