(Optional) Customize Cluster Installation¶
The Run:ai cluster creation wizard requires the download of a Helm values file runai-<cluster-name>.yaml
. The file may be edited to customize the cluster installation.
Configuration Flags¶
Key | Default | Description |
---|---|---|
pspEnabled | false | Set to true when using PodSecurityPolicy |
ingress-nginx.podSecurityPolicy.enabled | Set to true when using PodSecurityPolicy | |
runai-operator.config.project-controller.createNamespaces | true | Set to false if unwilling to provide Run:ai the ability to create namespaces. When set to false, will requires an additional manual step when creating new Run:ai Projects |
runai-operator.config.project-controller.createRoleBindings | true | Set to false when using OpenShift. When set to false, will require an additional manual step when assigning users to Run:ai Projects |
runai-operator.config.project-controller.clusterWideSecret | true | Set to false when using PodSecurityPolicy or OpenShift |
runai-operator.config.mps-server.enabled | false | Set to true to allow the use of NVIDIA MPS. MPS is useful with Inference workloads |
runai-operator.config.runai-container-toolkit.enabled | true | Controls the usage of Fractions. |
runai-operator.config.global.runtime | docker | Defines the container runtime of the cluster (supports docker and containerd ). Set to containerd when using Tanzu |
runai-operator.config.nvidiaDcgmExporter.namespace | gpu-operator | The namespace where dcgm-exporter (or gpu-operator) was installed. |
runai-operator.config.nvidiaDcgmExporter.installedFromGpuOperator | true | Indicated whether the dcgm-exporter was installed via gpu-operator or not. |
gpu-feature-discovery.enabled | true | Set to false to not install GPU Feature Discovery (assumes a prior install outside Run:ai scope). Flag is only relevant to Run:ai version 2.4 or lower |
kube-prometheus-stack.enabled | true | Set to false when the cluster has an existing Prometheus installation. that is not based the Prometheus operator . This setting requires Run:ai customer support. |
kube-prometheus-stack.prometheusOperator.enabled | true | Set to false when the cluster has an existing Prometheus installation based on the Prometheus operator and Run:ai should use the existing one rather than install a new one |
Feature Discovery¶
Not relevant
The Run:ai Cluster installation installs by default two pre-requisites: Kubernetes Node Feature Discovery (NFD) and NVIDIA GPU Feature Discovery (GFD).
- If your Kubernetes cluster already has GFD installed, you will want to set
gpu-feature-discovery.enabled
tofalse
. - NFD is a prerequisite of GFD. If GFD is not installed, but NFD is already installed, you can disable NFD installation by setting
gpu-feature-discovery.nfd.deploy
tofalse
.
Prometheus¶
The Run:ai Cluster installation uses Promethues. There are 3 alternative configurations:
- (The default) Run:ai installs Prometheus.
- Run:ai uses an existing Prometheus installation based on the Prometheus operator.
- Run:ai uses an existing Prometheus installation based on a regular Prometheus installation.
For option 2, disable the flag kube-prometheus-stack.prometheusOperator.enabled
. For option 3, please contact Run:ai Customer support.
Understanding Custom Access Roles¶
To review the access roles created by the Run:ai Cluster installation, see Understanding Access Roles
Manual Creation of Namespaces¶
Run:ai Projects are implemented as Kubernetes namespaces. By default, the administrator creates a new Project via the Administration user interface which then triggers the creation of a Kubernetes namespace named runai-<PROJECT-NAME>
. There are a couple of use cases that customers will want to disable this feature:
- Some organizations prefer to use their internal naming convention for Kubernetes namespaces, rather than Run:ai's default
runai-<PROJECT-NAME>
convention. - When PodSecurityPolicy is enabled, some organizations will not allow Run:ai to automatically create Kubernetes namespaces.
Follow the following process to achieve this
- Disable the namespace creation functionality. See the
runai-operator.config.project-controller.createNamespaces
flag above. - Create a Project using the Run:ai User Interface.
- Create the namespace if needed by running:
kubectl create ns <NAMESPACE>
. The suggested Run:ai default isrunai-<PROJECT-NAME>
. - Label the namespace to connect it to the Run:ai Project by running
kubectl label ns <NAMESPACE> runai/queue=<PROJECT_NAME>
where <PROJECT_NAME>
is the name of the project you have created in the Run:ai user interface above and <NAMESPACE>
is the name you chose for your namespace.