Upgrading a Cluster Installation¶
Find out Run:ai Cluster version¶
To find the Run:ai cluster version before and after upgrade run:
helm list -n runai -f runai-cluster
and record the chart version in the form of runai-cluster-<version-number>
Upgrade Run:ai cluster¶
Run:
kubectl apply -f https://raw.githubusercontent.com/run-ai/public/main/runai-crds.yaml
helm repo update
helm get values runai-cluster -n runai > values.yaml
helm upgrade runai-cluster runai/runai-cluster -n runai -f values.yaml
Upgrade from version 2.2 or older to version 2.3 or higher¶
Delete the cluster as described here and perform cluster installation again.
Upgrade from version 2.3 or older to version 2.4 or higher¶
Note
We recommend working with Run:ai support to perform this upgrade during a planned downtime.
- Make sure you have no fractional jobs running. These are Jobs that have been assigned a part of a GPU as described here.
- Upgrade the cluster as described above.
- Change NVIDIA Software installation as described below:
Previous install of the GPU Operator¶
If you have previously installed the NVIDIA GPU Operator, we recommend uninstalling the Operator and installing it again. Use NVIDIA Operator version 1.9 or higher.
Previous install of NVIDIA software on Each Node¶
An alternative method described in the past is to install NVIDIA software on each node separately. This method is documented by NVIDIA here and is no longer recommended for Run:ai.
If used, to upgrade Run:ai, you must now install the NVIDIA GPU Operator version 1.9 or higher, while taking into account the parts that are already installed on your system (typically the NVIDIA Drivers and the NVIDIA toolkit). Use the GPU Operator flags: --set driver.enabled=false --set toolkit.enabled=false --set migManager.enabled=false
.
After the installation/upgrade of both gpu-operator and runai-cluster, run: kubectl rollout restart ds nvidia-device-plugin-daemonset -n gpu-operator
Verify successful installation¶
To verify that the upgrade has succeeded run:
kubectl get pods -n runai
Verify that all pods are running or completed.