Skip to content

Troubleshooting Run:ai

Installation

Upgrade fails with "Ingress already exists"

Symptom: The installation fails with error: Error: rendered manifests contain a resource that already exists. Unable to continue with install: IngressClass "nginx" in namespace "" exists

Root cause: Run:ai installs NGINX, but there is an existing NGINX on the cluster.

Resolution: In the Run:ai cluster YAML file, disable the installation of NGINX by setting:

ingress-nginx:
    enabled: false

Dashboard Issues

No Metrics are showing on Dashboard

Symptom: No metrics are showing on dashboards at https://<company-name>.run.ai/dashboards/now

Typical root causes:

  • Firewall-related issues.
  • Internal clock is not synced.
  • Prometheus pods are not running.

Firewall issues

Add verbosity to Prometheus as describe here.Verify that there are no errors. If there are connectivity-related errors you may need to:

  • Check your firewall for outbound connections. See the required permitted URL list in Network requirements.
  • If you need to set up an internet proxy or certificate, please contact Run:ai customer support.

Machine Clocks are not synced

Run: date on cluster nodes and verify that date/time is correct. If not:

  • Set the Linux time service (NTP).
  • Restart Run:ai services. Depending on the previous time gap between servers, you may need to reinstall the Run:ai cluster

Prometheus pods are not running

Run: kubectl get pods -n monitoring -o wide

  • Verify that all pods are running.
  • The default Prometheus installation is not built for high availability. If a node is down, the Prometheus pod may not recover by itself unless manually deleted. Delete the pod to see it start on a different node and consider adding a second replica to Prometheus.
GPU Related metrics not showing

Symptom: GPU-related metrics such as GPU Nodes and Total GPUs are showing zero but other metrics, such as Cluster load are shown.

Root cause: An installation issue related to the NVIDIA stack.

Resolution:

Need to run through the NVIDIA stack and find the issue. The current NVIDIA stack looks as follows:

  1. NVIDIA Drivers (at the OS level, on every node)
  2. NVIDIA Docker (extension to Docker, on every node)
  3. Kubernetes Node feature discovery (mark node properties)
  4. NVIDIA GPU Feature discovery (mark nodes as “having GPUs”)
  5. NVIDIA Device plug-in (Exposes GPUs to Kubernetes)
  6. NVIDIA DCGM Exporter (Exposes metrics from GPUs in Kubernetes)

Run:ai requires the installation of the NVIDIA GPU Operator which installs the entire stack above. However, there are two alternative methods for using the operator:

  • Use the default operator values to install 1 through 6.
  • If NVIDIA Drivers (#1 above) are already installed on all nodes, use the operator with a flag that disables drivers install.

For more information see Cluster prerequisites.

NVIDIA GPU Operator

Run: kubectl get pods -n gpu-operator | grep nvidia and verify that all pods are running.

Node and GPU feature discovery

Kubernetes Node feature discovery identifies and annotates nodes. NVIDIA GPU Feature Discovery identifies and annotates nodes with GPU properties. See that:

  • All such pods are up.
  • The GPU feature discovery pod is available for every node with a GPU.
  • And finally, when describing nodes, they show an active gpu/nvidia resource.

NVIDIA Drivers

  • If NVIDIA drivers have been installed on the nodes themselves, ssh into each node and run nvidia-smi. Run sudo systemctl status docker and verify that docker is running. Run nvidia-docker and verify that it is installed and working. Linux software upgrades may require a node restart.
  • If NVIDIA drivers are installed by the Operator, verify that the NVIDIA driver daemonset has created a pod for each node and that all nodes are running. Review the logs of all such pods. A typical problem may be the driver version which is too advanced for the GPU hardware. You can set the driver version via operator flags.

NVIDIA DCGM Exporter

  • View the logs of the DCGM exporter pod and verify that no errors are prohibiting the sending of metrics.
  • To validate that the dcgm-exporter exposes metrics, find one of the DCGM Exporter pods and run:
kubectl port-forward <dcgm-exporter-pod-name> 9400:9400

Then browse to http://localhost:9400/metrics and verify that the metrics have reached the DCGM exporter.

  • The next step after the DCGM Exporter is Prometheus. To validate that metrics from the DCGM Exporter reach Prometheus, run:
kubectl port-forward svc/runai-cluster-kube-prometh-prometheus -n monitoring 9090:9090

Then browse to localhost:9090. In the UI, type DCGM_FI_DEV_GPU_UTIL as the metric name, and verify that the metric has reached Prometheus.

If the DCGM Exporter is running correctly and exposing metrics, but this metric does not appear in Prometheus, there may be a connectivity issue between these components.

Allocation-related metrics not showing

Symptom: GPU Allocation-related metrics such as Allocated GPUs are showing zero but other metrics, such as Cluster load are shown.

Root cause: The origin of such metrics is the scheduler.

Resolution:

  • Run: kubectl get pods -n runai | grep scheduler. Verify that the pod is running.
  • Review the scheduler logs and look for errors. If such errors exist, contact Run:ai customer support.
All metrics are showing "No Data"

Symptom: All data on all dashboards is showing the text "No Data".

Root cause: Internal issue with metrics infrastructure.

Resolution: Please contact Run:ai customer support.

Authentication Issues

After a successful login, you are redirected to the same login page

For a self-hosted installation, check Linux clock synchronization as described above. Use the Run:ai pre-install script to test this automatically.

Single-sign-on issues

For single-sign-on issues, see the troubleshooting section in the single-sign-on configuration document.

User Interface Submit Job Issues

New Job button is grayed out

Symptom: The New Job button on the top right of the Job list is grayed out.

Root Cause: This can happen due to multiple configuration issues:

  • Open Chrome developer tools and refresh the screen.
  • Under Network locate a network call error. Search for the HTTP error code.

Resolution for 401 HTTP Error

  • The Cluster certificate provided as part of the installation is valid and trusted (not self-signed).
  • Researcher Authentication has not been properly configured. Try running runai login from the Command-line interface. Alternatively, run: kubectl get pods -n kube-system, identify the api-server pod and review its logs.

Resolution for 403 HTTP Error

Run: kubectl get pods -n runai, identify the agent pod, see that it's running, and review its logs.

New Job button is not showing

Symptom: The New Job button on the top right of the Job list does not show.

Root Causes: (multiple)

  • You do not have Researcher or Research Manager permissions.
  • Under Settings | General, verify that Unified UI is on.
Submit form is distorted

Symptom: Submit form is showing vertical lines.

Root Cause: The control plane does not know the cluster URL.

Using the Run:ai user interface, go to the Clusters list. See that there is no cluster URL next to your cluster.

Resolution: Cluster must be re-installed.

Submit form does not show the list of Projects

Symptom: When connected with Single-sign-on, in the Submit form, the list of Projects is empty.

Root Cause: SSO is on and researcher authentication is not properly configured as such.

Resolution: Verify API Server settings as described in Researcher Authentication configuration.

Job form is not opening on OpenShift

Symptom: When clicking on "New Job" the Job forms does not load. Network shows 405

Root Cause: An installation step has been missed.

Resolution: Open the Cluster list and open the cluster installation wizard again. After selecting OpenShift, you will see a patch command at the end of the instruction set. Run it.

Networking Issues

'admission controller' connectivity issue

Symptoms:

  • Pods are failing with 'admission controller' connectivity errors.
  • The command-line runai submit fails with an 'admission controller' connectivity error.
  • Agent or cluster sync pods are crashing in self-hosted installation.

Root cause: Connectivity issues between different nodes in the cluster.

Resolution:

  • Run the preinstall script and search for networking errors.
  • Run: kubectl get pods -n kube-system -o wide. Verify that all networking pods are running.
  • Run: kubectl get nodes. Check that all nodes are ready and connected.
  • Run: kubectl get pods -o wide -A to see which pods are Pending or in Error and which nodes they belong to.
  • See if pods from different nodes have trouble communicating with each other.
  • Advanced, run: kubectl exec <pod-name> -it /bin/sh from a pod in one node and ping a pod from another.
Projects are not syncing

Symptom: Create a Project on the Run:ai user interface, then run: runai list projects. The new Project does not appear.

Root cause: The Run:ai agent is not syncing properly. This may be due to firewall issues.

Resolution

  • Run: runai pods -n runai | grep agent. See that the agent is in Running state. Select the agent's full name and run: kubectl logs -n runai runai-agent-<id>.
  • Verify that there are no errors. If there are connectivity-related errors you may need to check your firewall for outbound connections. See the required permitted URL list in Network requirements.
  • If you need to set up an internet proxy or certificate, please contact Run:ai customer support.
Jobs are not syncing

Symptom: A Job on the cluster (runai list jobs) does not show in the Run:ai user interface Job list.

Root cause: The Run:ai cluster-sync pod is not syncing properly.

Resolution: Search the cluster-sync pod for errors.

Jobs fail with ContainerCannotRun status

Symptom: When running runai list jobs, your Job has a status of ContainerCannotRun.

Root Cause: The issue may be caused due to an unattended upgrade of the NVIDIA driver.

To verify, run: runai describe job <job-name>, and search for an error driver/library version mismatch.

Resolution: Reboot the node on which the Job attempted to run.

Going forward, we recommend blacklisting NVIDIA driver from unattended-upgrades.
You can do that by editing /etc/apt/apt.conf.d/50unattended-upgrades, and adding nvidia-driver- to the Unattended-Upgrade::Package-Blacklist section.
It should look something like that:

Unattended-Upgrade::Package-Blacklist {
    // The following matches all packages starting with linux-
    //  "linux-";
    "nvidia-driver-";

Inference Issues

New Deployment button is grayed out

Symptoms:

  • The New Deployment button on the top right of the Deployment list is grayed out.
  • Cannot create a deployment via Inference API.

Root Cause: Run:ai Inference prerequisites have not been met.

Resolution: Review inference prerequisites and install accordingly.

New Deployment button is not showing

Symptom: The New Deployment button on the top right of the Deployments list does not show.

Root Cause: You do not have ML Engineer permissions.

Submitted Deployment remains in Pending state

Symptom: A submitted deployment is not running.

Root Cause: The patch statement to add the runai-scheduler has not been performed.

Some Autoscaling metrics are not working

Symptom: Deployments do not autoscale when using metrics other than requests-per-second or concurrency.

Root Cause: The horizontal pod autoscaler prerequisite has not been installed.

Deployment status is "Failed"

Symptom: Deployment status is always Failed.

Root Cause: (multiple)

  • Not enough resources in the cluster.
  • Server model command is misconfigured (i.e sleep infinity).
  • Server port is misconfigured.
Deployment does not scale up from zero

Symptom: In the Deployment form, when "Auto-scaling" is enabled, and "Minimum Replicas" is set to zero, the deployment cannot scale up from zero.

Root Cause:

  • Clients are not sending requests.
  • Clients are not using the same port/protocol as the server model.
  • Server model command is misconfigured (i.e sleep infinity).

Command-line interface Issues

Unable to install CLI due to certificate errors

Symptom: The curl command and download button to download the CLI is not working.

Root Cause: The cluster is not accessible from the download location

Resolution:

Use an alternate method for downloading the CLI. Run:

kubectl port-forward -n runai svc/researcher-service 4180

In another shell, run:

wget --content-disposition http://localhost:4180/cli/linux

When running the CLI you get an error: open .../.kube/config.lock: permission denied

Symptom: When running any CLI command you get a permission denied error.

Root Cause: The user running the CLI does not have read permissions to the .kube directory.

Resolution: Change permissions for the directory.

When running 'runai logs', the logs are delayed

Symptom: Printout from the container is not immediately shown in the log.

Root Cause: By default, Python buffers stdout, and stderr, which are not flushed in real-time. This may cause logs to appear sometimes minutes after being buffered.

Resolution: Set the env var PYTHONUNBUFFERED to any non-empty string or pass -u to Python. e.g. python -u main.py.

CLI does not download properly on OpenShift

Symptom: When trying to download the CLI on OpenShift, the wget statement downloads a text file named darwin or linux rather than the binary runai.

Root Cause: An installation step has been missed.

Resolution: Open the Cluster list and open the cluster installation wizard again. After selecting OpenShift, you will see a patch command at the end of the instruction set. Run it.