Upgrading a Cluster Installation

Find out Run:ai Cluster version

To find the Run:ai cluster version before and after upgrade run:

helm list -n runai -f runai-cluster

and record the chart version in the form of runai-cluster-<version-number>

Upgrade Run:ai cluster


kubectl apply -f
helm repo update
helm get values runai-cluster -n runai > values.yaml
helm upgrade runai-cluster runai/runai-cluster -n runai -f values.yaml

Upgrade from version 2.2 or older to version 2.3 or higher

Delete the cluster as described here and perform cluster installation again.

Upgrade from version 2.3 or older to version 2.4 or higher


We recommend working with Run:ai support to perform this upgrade during a planned downtime.

  1. Make sure you have no fractional jobs running. These are Jobs that have been assigned a part of a GPU as described here.
  2. Upgrade the cluster as described above.
  3. Change NVIDIA Software installation as described below:

Previous install of the GPU Operator

If you have previously installed the NVIDIA GPU Operator, we recommend uninstalling the Operator and installing it again. Use NVIDIA Operator version 1.9 or higher.

Previous install of NVIDIA software on Each Node

An alternative method described in the past is to install NVIDIA software on each node separately. This method is documented by NVIDIA here and is no longer recommended for Run:ai.

If used, to upgrade Run:ai, you must now install the NVIDIA GPU Operator version 1.9 or higher, while taking into account the parts that are already installed on your system (typically the NVIDIA Drivers and the NVIDIA toolkit). Use the GPU Operator flags: --set driver.enabled=false --set toolkit.enabled=false --set migManager.enabled=false.

After the installation/upgrade of both gpu-operator and runai-cluster, run: kubectl rollout restart ds nvidia-device-plugin-daemonset -n gpu-operator

Verify successful installation

To verify that the upgrade has succeeded run:

kubectl get pods -n runai

Verify that all pods are running or completed.

Last update: May 1, 2022