Upgrading a Cluster Installation¶
Find out Run:ai Cluster version¶
To find the Run:ai cluster version before and after upgrade run:
and record the chart version in the form of
Upgrade Run:ai cluster¶
<version> with the new version number in the command below. Then run:
2.7.14). You can find Run:ai version numbers by running
helm search repo -l runai-cluster.
Upgrade from version 2.3 or older to version 2.4 or higher¶
We recommend working with Run:ai support to perform this upgrade during planned downtime.
- Make sure you have no fractional jobs running. These are Jobs that have been assigned a part of a GPU as described here.
- Upgrade the cluster as described above.
- Change NVIDIA Software installation as described below:
If you have previously installed the NVIDIA GPU Operator, we recommend uninstalling the Operator and installing it again. Use NVIDIA Operator version 1.9 or higher.
An alternative method described in the past is to install NVIDIA software on each node separately. This method is documented by NVIDIA here and is no longer recommended for Run:ai.
If used, to upgrade Run:ai, you must now install the NVIDIA GPU Operator version 1.9 or higher, while taking into account the parts that are already installed on your system (typically the NVIDIA Drivers and the NVIDIA toolkit). Use the GPU Operator flags: `--set driver.enabled=false.
After the installation/upgrade of both gpu-operator and runai-cluster, run:
kubectl rollout restart ds nvidia-device-plugin-daemonset -n gpu-operator
Upgrade to a Specific Verison¶
To upgrade to a specific version, add
--version <version-number> to the
helm upgrade command.
Verify Successful Installation¶
To verify that the upgrade has succeeded run:
Verify that all pods are running or completed.