Run:ai Version 2.8¶
Node Pools is a new method for managing GPU and CPU resources by grouping the resources into distinct pools. With node pools:
- The administrator allocates Project and Department resources from these pools to be used by Workloads.
- The administrator controls which workloads can use which resources, allowing an optimized utilization of resources according to customer's specific mode of operation.
User Interface Enhancements¶
- The Departments screen has been revamped and new functionality added, including a new and clean look and feel, and improved search and filtering capabilities.
- The Jobs screen has been split into 2 tabs for ease of use:
- Current: (the default tab) consists of all the jobs that currently exist in the cluster.
- History: consists of all the jobs that have been deleted from the cluster. Deleting Jobs also deletes their Log (no change).
The Run:ai user interface requires a URL address to the Kubernetes cluster. The requirement is relevant for SaaS installation only.
In previous versions of Run:ai the administrator should provide an IP address and Run:ai would automatically create a DNS entry for it and a matching trusted certificate.
In version 2.8, the default is for the Run:ai administrator to provide a DNS and a trusted certificate.
The older option still exists but is being deprecated due to complexity.
The Deployment details page now contains the URL for the Inference service
Hyperparameter Optimization (HPO)¶
HPO Jobs are now presented as a single line in the Job List rather than a separate line per experiment.
|The Run:ai access control system prevents setting a role of researcher together with ML engineer or researcher manager at the same time. However, using the UI you can select these two roles by clicking the text near the check
|When installing Run:ai on OpenShift a second time, oauth client secret is incorrect/not updated. As a result, login is not possible
|Can be performed via manual configuration. Please contact Run:ai support.
|In the multi cluster overview, the allocated GPU in the table of each cluster is wrong. The correct number is in the overview dashboard.
|When deleting a cluster, there are leftover pods that are not deleted. No side effects on functionality.
|Delete the pods manually.
|(SaaS version only) The new control plane, versioned 2.8 does not allow the creation of a new deployment on a cluster whose version is lower than 2.8.
|Upgrade your cluster to 2.8
|It is possible to change runai/node-pool label of a running pod. This is a wrong usage of the system and may cause unexpected behavior.
|Idle allocated GPU metric is not displayed for MIG workloads in OpenShift.
|When selecting a Job, the GPU memory utilization metrics is not displayed on the right-hand side. This is an NVIDIA DCGM known bug (see: https://github.com/NVIDIA/dcgm-exporter/issues/103 ) which has been fixed in a later version but was not yet included in the latest NVIDIA GPU Operator
|Install the suggested version as described by NVIDIA.
|Dashboard panels of GPU Allocation/project and Allocated jobs per project metrics: In rare cases, some metrics reflect the wrong number of GPUs
|Dynamic MIG feature does not work with A-100 with 80GB of memory.
|When a workload is selected in the job list, the GPU tab in the right panel, shows the details of the whole GPUs in the node, instead of the details of the GPUs used by the workload.
|In rare occasions, when there is more than 1 NVIDIA MIG workload, nvidia-smi command to one of the workloads will result with no devices.
|In rare cases, when using fractions and the kubelet service on the scheduled node is down (Kubernetes not running on node)the pending workload will never run, even when the IT problem is solved.
|Delete the job and re-submit the workload.
|Requested GPUs are sometimes displayed in the Job list as 0 for distributed workloads.
|None. This is a display-only issue
|On EKS (Amazon Kubernetes Server), when using runai CLI, every command response starts with an error. No functionality harm.
|None. The CLI functions as expected.
|When Interactive Jupyter notebook workloads that contain passwords are cloned, the password is exposed in the displayed CLI command.
|When using the Home environment variable in conjunction with the ran-as-user option in the CLI, the Home environment variable is overwritten with the user's home directory.
|It is possible to submit two jobs with the same node-port.
|When you apply an inference deployment via a file, the allocated GPUs are displayed as 0 in the deployments list.
|When workloads are deleted while the cluster synchronization is down, there might be a non-existent Job shown in the user interface. The Job cannot be deleted.
|In some situations, when a Job is deleted, there may be leftover Kubernetes configMaps in the system
|In some cases, an error "failed to load data" can be seen in the graphs showing on the Job sidebar.
|The default Kubernetes "priority Class" for deployments is the same as the priority class for interactive jobs.
|In some scenarios, Dashboards may show "found duplicate series for the match group" error
|The scheduler is wrongly trying to schedule jobs on a node, where there are allocated GPU jobs at an "ImagePullBackoff" state. This causes an error of "UnexpectedAdmissionError"
|The role "Researcher Manager" is not displayed in the access control list of projects.
|Users are trying to login with single-sign-on get a "review profile" page.
|Single HPO (hyperparameter optimization) workload is displayed in the Job list user interfgace as multiple jobs (one for every pod).