Skip to content

Run:ai version 2.13

Version 2.13.0

Release content

This version contains features and fixes from previous versions starting with 2.9. Refer to the prior versions for specific features and fixes.

Projects

  • Improved the Projects UI for ease of use. Projects follows UI upgrades and changes that are designed to make setting up of components and assets easier for administrators and researchers. To configure a project, see Projects.

Dashboards

  • Added a new dashboard for Quota management, which provides an efficient means to monitor and manage resource utilization within the AI cluster. The dashboard filters the display of resource quotas based on Departments, Projects, and Node pools. For more information, see Quota management dashboard.

  • Added to the Overview dashboard, the ability to filter the cluster by one or more node pools. For more information, see Node pools.

Nodes and Node pools

  • Run:ai scheduler supports 2 scheduling strategies: Bin Packing (default) and Spread. For more information, see Scheduling strategies. You can configure the scheduling strategy in the node pool level to improve the support of clusters with mixed types of resources and workloads. For configuration information, see Creating new node pools.

  • GPU device level DCGM Metrics are collected per GPU and presented by Run:ai in the Nodes table. Each node contains a list of its embedded GPUs with their respective DCGM metrics. See DCGM Metrics for the list of metrics which are provided by NVidia DCGM and collected by Run:ai. Contact your Run:ai customer representative to enable this feature.

  • Added per node pool over-quota priority. Over-quota priority sets the relative amount of additional unused resources that an asset can get above its current quota. For more information, see Over-quota priority.
  • Added support of associating workspaces to node pool. The association between workspaces and node pools is done using Compute resources section. In order to associate a compute resource to a node pool, in the Compute resource section, press More settings. Press Add new to add more node pools to the configuration. Drag and drop the node pools to set their priority.
  • Added Node pool selection as part of the workload submission form. This allows researchers to quickly determine the list of node pools available and their priority. Priority is set by dragging and dropping them in the desired order of priority. In addition, when the node pool priority list is locked by a policy, the list isn't editable by the Researcher even if the workspace is created from a template or copied from another workspace.

Time limit duration

  • Improved the behavior of any workload time limit (for example, Idle time limit) so that the time limit will affect existing workloads that were created before the time limit was configured. This is an optional feature which provides help in handling situations where researchers leave sessions open even when they do not need to access the resources. For more information, see Limit duration of interactive training jobs.

  • Improved workspaces time limits. Workspaces that reach a time limit will now transition to a state of stopped so that they can be reactivated later.

  • Added time limits for training jobs per project. Administrators (Department Admin, Editor) can limit the duration of Run:ai Training jobs per Project using a specified time limit value. This capability can assist administrators to limit the duration and resources consumed over time by training jobs in specific projects. Each training job that reaches this duration will be terminated.

Workload assets

  • Extended the collaboration functionality for any workload asset such as Environment, Compute resource, and some Data source types. These assets are now shared with Departments in the organization in addition to being shared with specific projects, or the entire cluster.
  • Added a search box for card galleries in any asset based workload creation form to provide an easy way to search for assets and resources. To filter use the asset name or one of the field values of the card.

PVC data sources

  • Added support for PVC block storage in the New data source form. In the New data source form for a new PVC data source, in the Volume mode field, select from Filesystem or Block. For more information, see Create a PVC data source.

Credentials

  • Added Docker registry to the Credentials menu. Users can create docker credentials for use in specific projects for image pulling. To configure credentials, see Configuring credentials.

Policies

  • Improved policy support by adding DEFAULTS in the items section in the policy. The DEFAULTS section sets the default behavior for items declared in this section. For example, this can be use to limit the submission of workloads only to existing PVCs. For more information and an example, see Policies, Complex values.
  • Added support for making a PVC data source available to all projects. In the New data source form, when creating a new PVC data source, select All from the Project pane.

Researcher API

Integrations

  • Added support for Ray jobs. Ray is an open-source unified framework for scaling AI and Python applications. For more information, see Integrate Run:ai with Ray.

  • Added integration with Weights & Biases Sweep to allow data scientists to submit hyperparameter optimization workloads directly from the Run:ai UI. To configure sweep, see Sweep configuration.

  • Added support for XGBoost. XGBoost, which stands for Extreme Gradient Boosting, is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems. For more information, see runai submit-dist xgboost

Compatability

Installation

  • The manual process of upgrading Kubernetes CRDs is no longer needed when upgrading to the most recent version (2.13) of Run:ai.
  • From Run:ai 2.12 and above, the control-plane installation has been simplified and no longer requires the creation of a backend values file. Instead, install directly using helm as described in Install the Run:ai Control Plane.
  • From Run:ai 2.12 and above, the air-gapped, control-plane installation now generates a custom-env.yaml values file during the preparation stage. This is used when installing the control-plane.

Known issues

Internal ID Description
RUN-11005 Incorrect error messages when trying to run runai CLI commands in an OpenShift environment.
RUN-11009 Incorrect error message when a user without permissions to tries to delete another user.

Fixed issues

Internal ID Description
RUN-9039 Fixed an issue where in the new job screen, after toggling off the preemptible flag, and a job is submitted, the job still shows as preemptible.
RUN-9323 Fixed an issue with a non-scaleable error message when scheduling hundreds of nodes is not successful.
RUN-9324 Fixed an issue where the scheduler did not take into consideration the amount of storage so there is no explanation that pvc is not ready.
RUN-9902 Fixed an issue in OpenShift environments, where there are no metrics in the dashboard because Prometheus doesn’t have permissions to monitor the runai namespace after an installation or upgrade to 2.9.
RUN-9920 Fixed an issue where the canEdit key in a policy is not validated properly for itemized fields when configuring an interactive policy.
RUN-10052 Fixed an issue when loading a new job from a template gives an error until there are changes made on the form.
RUN-10053 Fixed an issue where the Node pool column is unsearchable in the job list.
RUN-10422 Fixed an issue where node details show running workloads that were actually finished (successfully/failed/etc.).
RUN-10500 Fixed an issue where jobs are shown as running even though they don't exist in the cluster.
RUN-10813 Fixed an issue in adding a data source where the path is case sensitive and didn't allow uppercase.