Skip to content

Hotfixes for 2.18

The following is a list of the known and fixed issues for Run:ai V2.18.

Hotfixes

Internal ID Hotfix # Description
RUN-23055 2.18.60 Fixed unified Distributed and Training CLI commands
RUN-23243 2.18.59 Fixed an issue where the scope tree wasn't calculating permissions correctly
RUN-22463 2.18.59 Fixed an error in CLI bash command
RUN-22314 2.18.59 Fixed distributed framework filtering in API commands
RUN-23142 2.18.58 Fixed an issue where advanced GPU metrics per-gpu don't have gpu label
RUN-23001 2.18.58 Fixed an issue of false overcommit on out-of-memory killed in the “swap” feature.
RUN-22851 2.18.58 Fixed an issue where client may get stuck on device lock acquired during “swap” out-migration
RUN-22758 2.18.58 Fixed an issue where inference workload showed wrong status when submission failed.
RUN-22544 2.18.58 Updated Grafana version for security vulnerabilities.
RUN-23055 2.18.57 Fixed the unified Distributed and Training CLI commands
RUN-23014 2.18.56 Fixed an issue where node-scale-adjuster might not create a scaling pod if it is in cool-down and the pod was not updated after that.
RUN-22660 2.18.56 Fixed an issue where workload charts have an unclear state
RUN-22457 2.18.55 Fixed an issue where in rare edge cases the cluster-sync pod was out of memory.
RUN-21825 2.18.55 Fixed all CVEs in Run:ai's Goofys-based image used for S3 integration.
RUN-22871 2.18.55 Fixed an issue in runai-container-toolkit where in certain cases when a process is preempted, OOMKill metrics were not published correctly.
RUN-22250 2.18.55 Fixed an issue where workloads trying to use an ingress URL which is already in use were behaving inconsistentyly instead of failing immediately.
RUN-22880 2.18.55 Fixed an issue where the minAvailable field for training-operator CRDs did not consider all possible replica specs.
RUN-22073 2.18.55 Fixed an issue where runai-operator failed to parse cluster URLs ending with '/'.
RUN-22453 2.18.55 Fixed an issue where in rare edge cases the workload-overseer pod experienced a crash.
RUN-22763 2.18.55 Fixed an issue where in rare edge cases an 'attach' command from CLI-V2 caused a crash in the cluster-api service.
RUN-21948 2.18.49 Fixed an issue where in rare edge cases workload child resources could have duplicate names, causing inconsistent behavior.
RUN-22623 2.18.49 Fixed an issue in Openshift where workloads were not suspended when reaching their idle GPU time limit.
RUN-22600 2.18.49 Fixed an issue in AWS EKS clusters where the V1-CLI returned an empty table when listing all projects as an administrator.
RUN-21878 2.18.49 Added a label to disable container toolkit from running on certain nodes run.ai/container-toolkit-enabled.
RUN-22452 2.18.47 Fixed an issue where the scheduler has signature errors if TopologySpreadConstraints was partially defined.
RUN-22570 2.18.47 Updated git-sync image to version v4.3.0.
RUN-22054 2.18.46 Fixed an issue where users could not attach to jobs.
RUN-22377 2.18.46 Removed uncached client from accessrule-controller.
RUN-21697 2.18.46 Fixed an issue where client may deadlock on suspension during allocation request.
RUN-20073 2.18.45 Fixed an issue where it wasn't possible to authenticate with user credentials in the CLI.
RUN-21957 2.18.45 Fixed an issue where there was a missing username-loader container in inference workloads.
RUN-22276 2.18.39 Fixed an issue where Knative external URL was missing from the Connections modal.
RUN-22280 2.18.39 Fixed an issue when setting scale to zero - there was no pod counter in the Workload grid.
RUN-19811 2.18.39 Added an option to set k8s tolerations to run:ai daemonsets (container-toolkit, runai-device-plugin, mig-parted, node-exporter, etc..) .
RUN-22128 2.18.39 Added GID, UID, Supplemental groups to the V1 CLI.
RUN-21800 2.18.37 Fixed an issue with old workloads residing in the cluster.
RUN-21907 2.18.34 Fixed an issue where the SSO user credentials contain supplementary groups as string instead of int.
RUN-21272 2.18.31 Fixed an issue with multi-cluster credinatils creation, specifically with the same name in different clusters.
RUN-20680 2.18.29 Fixed an issue where workloads page do not present requested GPU.
RUN-21200 2.18.29 Fixed issues with upgrades and connections from v2.13.
RUN-20970 2.18.27 Fixed an issue with PUT APIs.
RUN-20927 2.18.26 Fixed an issue where node affinity was not updated correctly in projects edit.
RUN-20084 2.18.26 Fixed an issue where default department were deleted instead of a message being displayed.
RUN-21062 2.18.26 Fixed issues with the API documentation.
RUN-20434 2.18.25 Fixed an issue when creating a Project/Department with memory resources requires 'units'.
RUN-20923 2.18.25 Fixed an issue with projects/departments page loading slowly.
RUN-19872 2.18.23 Fixed an issue where the Toolkit crashes and fails to create and replace the publishing binaries.
RUN-20861 2.18.22 Fixed an issue where a pod is stuck on pending due to a missing resource reservation pod.
RUN-20842 2.18.22 Fixed an issue of illegal model name with "." in hugging face integration.
RUN-20791 2.18.22 Fix an issue where notifications froze after startup.
RUN-20865 2.18.22 Fixed an issue where default departments are not deleted when a cluster is deleted.
RUN-20698 2.18.21 Fixed an issue where 2 processes requests a device at the same time received the same GPU, causing failures.
RUN-20760 2.18.18 Fixed an issue where workload protection UI shows wrong status.
RUN-20612 2.18.15 Fixed an issue where it was impossible with the use-table-data to hide node pool columns when there is only one default node pool.
RUN-20735 2.18.15 Fixed an issue where nodePool.name is undefined
RUN-20721 2.18.12 Added error handling to nodes pages.
RUN-20578 2.18.10 Fixed an issue regarding policy enforcement.
RUN-20188 2.18.10 Fixed issue with defining SSO in OpenShift identity provider.
RUN-20673 2.18.9 Fixed an issue where a researcher uses a distributed elastic job, it is possible that in a specific flow it is scheduled on more than one node-pools.
RUN-20360 2.18.7 Fixed an issue where the workload network status was misleading.
RUN-22107 2.18.7 Fixed an issue where passwords containing $ were removed from the configuration.
RUN-20510 2.18.5 Fixed an issue with external workloads - argocd workflow failed to be updated.
RUN-20516 2.18.4 Fixed an issue when after deploying to prod, the cluster-service and authorization-service got multiple OOMKilled every ~1 hour.
RUN-20485 2.18.2 Changed policy flags to Beta.
RUN-20005 2.18.1 Fixed an issue where a sidecar container failure failed the workload.
RUN-20169 2.18.1 Fixed an issue allowing the addition of annotations and labels to workload resources.
RUN-20108 2.18.1 Fixed an issue exposing service node ports to workload status.
RUN-20160 2.18.1 Fixed an issue with version display when installing a new cluster in an airgapped environment.
RUN-19874 2.18.1 Fixed an issue when copying and editing a workload with group access to a tool and the group wasn't removed when selecting users option.
RUN-19893 2.18.1 Fixed an issue when using a float number in the scale to zero inactivity value - custom which sometimes caused the submission to fail.
RUN-20087 2.18.1 Fixed an issue where inference graphs should be displayed only for minimum cluster versions.
RUN-10733 2.18.1 Fixed an issue where we needed to minify and obfuscate our code in production.
RUN-19962 2.18.1 Fixed an issue to fix sentry domains regex and map them to relevant projects.
RUN-20104 2.18.1 Fixed an issue where frontend Infinite loop on keycloak causes an error.
RUN-19906 2.18.1 Fixed an issue where inference workload name validation fails with 2.16 cluster.
RUN-19605 2.18.1 Fixed an issue where authorized users should support multiple users (workload-controller) .
RUN-19903 2.18.1 Fixed an issue where inference chatbot creation fails with 2.16 cluster.
RUN-20409 2.18.1 Fixed an issue where clicking on create new compute during the runai model flow did nothing.
RUN-11224 2.18.1 Fixed an issue where ruani-adm collect all logs was not collecting all logs.
RUN-20478 2.18.1 Improved workloads error status in overview panel.
RUN-19850 2.18.1 Fixed an issue where an application administrator could not submit a job with CLI.
RUN-19863 2.18.1 Fixed an issue where department admin received 403 on get tenants and cannot login to UI.
RUN-19904 2.18.1 Fixed an issue when filtering by allocatedGPU in get workloads with operator returns incorrect result.
RUN-19925 2.18.1 Fixed an issue when upgrade from v2.16 to v2.18 failed on worklaods migrations.
RUN-19887 2.18.1 Fixed an issue in the UI when there is a scheduling rule of timeout, the form opened with the rules collapsed and written "none".
RUN-19941 2.18.1 Fixed an issue where completed and failed jobs were shown in view pods in nodes screen.
RUN-19940 2.18.1 Fixed an issue where setting gpu quota failed because the department quota was taken from wrong department.
RUN-19890 2.18.1 Fixed an issue where editing a project by removing its node-affinity stuck updating.
RUN-20120 2.18.1 Fixed an issue where project update fails when there is no cluster version.
RUN-20113 2.18.1 Fixed an issue in the Workloads table where a researcher does not see other workloads once they clear their filters.
RUN-19915 2.18.1 Fixed an issue when turning departments toggles on on cluster v2.11+ the gpu limit is -1 and there is ui error.
RUN-20178 2.18.1 Fixed an issue where dashboard CPU tabs appeared in new overview.
RUN-20247 2.18.1 Fixed an issue where you couldn't create a workload with namespace of a deleted project.
RUN-20138 2.18.1 Fixed an issue where the system failed to create node-type on override-backend env.
RUN-18994 2.18.1 Fixed an issue where some limitations for department administrator are not working as expected.
RUN-19830 2.18.1 Fixed an issue where resources (GPU, CPU, Memory) units were added to k8s events that are published by run:ai scheduler making our messages more readable.

Version 2.18.0 Fixes

Internal ID Description
RUN-20734 Fixed an issue where the enable/disable toggle for the feature was presenting wrong info.
RUN-19895 Fixed an issue of empty state for deleted workloads which is incorrect.
RUN-19507 Fixed an issue in V1 where get APIs are missing required field in swagger leading to omit empty.
RUN-20246 Fixed an issue in Departments v1 org unit where if unrecognizable params are sent, an error is returned.
RUN-19947 Fixed an issue where pending multi-nodepool podgroups got stuck after cluster upgrade.
RUN-20047 Fixed an issue where Workload status shows as "deleting" rather than "deleted" in side panel.
RUN-20163 Fixed an issue when a DV is shared with a department and a new project is added to this dep - no pvc/pv is created.
RUN-20484 Fixed an issue where Create Projects Requests Returned 500 - services is not a valid ResourceType.
RUN-20354 Fixed an issue when deleting a department with projects resulted in projects remaining in environment with the status NotReady.