Skip to content

Install a Cluster

Customize Installation

  • Perform the cluster installation instructions explained here.
  • (Optional) make the following changes to the configuration file you have downloaded:
Key Default Description
pspEnabled false Set to true when using PodSecurityPolicy
runai-operator.config.project-controller.createNamespaces true Set to false if unwilling to provide Run:AI the ability to create namespaces, or would want to create namespaces manually rather than use the Run:AI convention of runai-<PROJECT-NAME>. When set to false, will require an additional manual step when creating new Run:AI Projects.
runai-operator.config.project-controller.createRoleBindings true Automatically assign Users to Projects. Set to false if unwilling to provide Run:AI the ability to set RoleBinding. When set to false, will require an additional manual step when adding or removing users from Projects.
runai-operator.config.project-controller.clusterWideSecret true Set to false if unwilling to provide Run:AI the ability to create Kubernetes Secrets. When not enabled, automatic secret propagation will not be available
runai-operator.config.mps-server.enabled false Allow the use of NVIDIA MPS. MPS is useful with Inference workloads. Requires extra cluster permissions
runai-operator.config.runai-container-toolkit.enabled true Controls the usage of Fractions. Requires extra cluster permissions
runai-operator.config.runaiBackend.password Default password already set admin@run.ai password. Need to change only if you have changed the password here
gpu-feature-discovery.enabled true Install Node Feature Discovery. Set to false if already installed in cluster
kube-prometheus-stack.enabled true Install Prometheus. Set to false if Prometheus is already installed in cluster
mpi-operator.enabled true Set to false when using PodSecurityPolicy. MPI is the distributed-training operator from KubeFlow. Currently must run with root access

Feature Discovery

The Run:AI Cluster installation installs by default two pre-requisites: Kubernetes Node Feature Discovery (NFD) and NVIDIA GPU Feature Discovery (GFD).

  • If your Kubernetes cluster already has GFD installed, you will want to set gpu-feature-discovery.enabled to false.
  • NFD is a prerequisite of GFD. If GFD is not installed, but NFD is already installed, you can disable NFD installation by setting gpu-feature-discovery.nfd.deploy to false.

Prometheus

The Run:AI Cluster installation installs Prometheus by default. If your Kubernetes cluster already has Prometheus installed, set the flag kube-prometheus-stack.enabled to false.

When using an existing Prometheus installation, you will need to add additional rules to your Prometheus configuration. The rules can be found under deploy/runai-prometheus-rules.yaml.

NVIDIA Device Plugin

Run:AI has customized the NVIDIA device plugin for Kubernetes. If you have installed the NVIDIA GPU Operator or have previously installed this plug-in, run the following to disable the existing plug-in:

kubectl -n gpu-operator-resources patch daemonset nvidia-device-plugin-daemonset \
   -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'

Install Cluster

Run:

helm install runai-cluster -n runai  \ 
  runai-cluster-<version>.tgz -f runai-<cluster-name>.yaml --create-namespace
helm repo add runai https://run-ai-charts.storage.googleapis.com
helm repo update

helm install runai-cluster runai/runai-cluster -n runai \
    -f runai-<cluster-name>.yaml --create-namespace

Tip

Use the --dry-run flag to gain an understanding of what is being installed before the actual installation. For more details see: Understanding cluster access foles.

Next Steps

Continue to create Run:AI Projects.


Last update: August 30, 2021