Troubleshooting Run:ai¶
Installation¶
Upgrade fails with "Ingress already exists"
Symptom: The installation fails with error: Error: rendered manifests contain a resource that already exists. Unable to continue with install: IngressClass "nginx" in namespace "" exists
Root cause: Run:ai installs NGINX
, but there is an existing NGINX on the cluster.
Resolution: In the Run:ai cluster YAML file, disable the installation of NGINX by setting:
How to get installation logs
Symptom: Installation fails and you need to troubleshoot the issue.
Resolution: Run the following script to obtain any relevant installation logs in case of an error.
Upgrade fails with "rendered manifests contain a resource that already exists" error
Symptom: The installation fails with error: Error: rendered manifests contain a resource that already exists. Unable to continue with install:...
Root cause: The Run:ai installation is trying to create a resource that already exists, which may be due to a previous installation that was not properly removed.
Resolution: Run the following script to remove all Run:ai resources and reinstall:
Then reinstall Run:ai.
Pods are failing due to certificate issues
Symptom: Pods are failing with certificate issues.
Root cause: The certificate provided during the Control Plane's installation is not valid.
Resolution: Verify that the certificate is valid and trusted. If the certificate is valid, but is signed by a local CA, make sure you have followed the procedure for a local certificate authority.
Cluster Health¶
See Cluster Health Troubleshooting
Dashboard Issues¶
No Metrics are showing on Dashboard
Symptom: No metrics are showing on dashboards at https://<company-name>.run.ai/dashboards/now
Typical root causes:
- Firewall-related issues.
- Internal clock is not synced.
- Prometheus pods are not running.
Firewall issues
Add verbosity to Prometheus as describe here.Verify that there are no errors. If there are connectivity-related errors you may need to:
- Check your firewall for outbound connections. See the required permitted URL list in Network requirements.
- If you need to set up an internet proxy or certificate, please contact Run:ai customer support.
Machine Clocks are not synced
Run: date
on cluster nodes and verify that date/time is correct. If not:
- Set the Linux time service (NTP).
- Restart Run:ai services. Depending on the previous time gap between servers, you may need to reinstall the Run:ai cluster
Prometheus pods are not running
Run: kubectl get pods -n monitoring -o wide
- Verify that all pods are running.
- The default Prometheus installation is not built for high availability. If a node is down, the Prometheus pod may not recover by itself unless manually deleted. Delete the pod to see it start on a different node and consider adding a second replica to Prometheus.
GPU Related metrics not showing
Symptom: GPU-related metrics such as GPU Nodes
and Total GPUs
are showing zero but other metrics, such as Cluster load
are shown.
Root cause: An installation issue related to the NVIDIA stack.
Resolution:
Need to run through the NVIDIA stack and find the issue. The current NVIDIA stack looks as follows:
- NVIDIA Drivers (at the OS level, on every node)
- NVIDIA Docker (extension to Docker, on every node)
- Kubernetes Node feature discovery (mark node properties)
- NVIDIA GPU Feature discovery (mark nodes as “having GPUs”)
- NVIDIA Device plug-in (Exposes GPUs to Kubernetes)
- NVIDIA DCGM Exporter (Exposes metrics from GPUs in Kubernetes)
Run:ai requires the installation of the NVIDIA GPU Operator which installs the entire stack above. However, there are two alternative methods for using the operator:
- Use the default operator values to install 1 through 6.
- If NVIDIA Drivers (#1 above) are already installed on all nodes, use the operator with a flag that disables drivers install.
For more information see [System requirements](../runai-setup/cluster-setup/.
NVIDIA GPU Operator
Run: kubectl get pods -n gpu-operator | grep nvidia
and verify that all pods are running.
Node and GPU feature discovery
Kubernetes Node feature discovery identifies and annotates nodes. NVIDIA GPU Feature Discovery identifies and annotates nodes with GPU properties. See that:
- All such pods are up.
- The GPU feature discovery pod is available for every node with a GPU.
- And finally, when describing nodes, they show an active
gpu/nvidia
resource.
NVIDIA Drivers
- If NVIDIA drivers have been installed on the nodes themselves, ssh into each node and run
nvidia-smi
. Runsudo systemctl status docker
and verify that docker is running. Runnvidia-docker
and verify that it is installed and working. Linux software upgrades may require a node restart. - If NVIDIA drivers are installed by the Operator, verify that the NVIDIA driver daemonset has created a pod for each node and that all nodes are running. Review the logs of all such pods. A typical problem may be the driver version which is too advanced for the GPU hardware. You can set the driver version via operator flags.
NVIDIA DCGM Exporter
- View the logs of the DCGM exporter pod and verify that no errors are prohibiting the sending of metrics.
- To validate that the dcgm-exporter exposes metrics, find one of the DCGM Exporter pods and run:
Then browse to http://localhost:9400/metrics and verify that the metrics have reached the DCGM exporter.
- The next step after the DCGM Exporter is
Prometheus
. To validate that metrics from the DCGM Exporter reach Prometheus, run:
Then browse to localhost:9090. In the UI, type DCGM_FI_DEV_GPU_UTIL
as the metric name, and verify that the metric has reached Prometheus.
If the DCGM Exporter is running correctly and exposing metrics, but this metric does not appear in Prometheus, there may be a connectivity issue between these components.
Allocation-related metrics not showing
Symptom: GPU Allocation-related metrics such as Allocated GPUs
are showing zero but other metrics, such as Cluster load
are shown.
Root cause: The origin of such metrics is the scheduler.
Resolution:
- Run:
kubectl get pods -n runai | grep scheduler
. Verify that the pod is running. - Review the scheduler logs and look for errors. If such errors exist, contact Run:ai customer support.
All metrics are showing "No Data"
Symptom: All data on all dashboards is showing the text "No Data".
Root cause: Internal issue with metrics infrastructure.
Resolution: Please contact Run:ai customer support.
Authentication Issues¶
After a successful login, you are redirected to the same login page
For a self-hosted installation, check Linux clock synchronization as described above. Use the Run:ai preinstall diagnostics tool to validate System and network requirements and test this automatically.
Single-sign-on issues
For single-sign-on issues, see the troubleshooting section in the single-sign-on configuration documents.
User Interface Submit Job Issues¶
New Job button is grayed out
Symptom: The New Job
button on the top right of the Job list is grayed out.
Root Cause: This can happen due to multiple configuration issues:
- Open Chrome developer tools and refresh the screen.
- Under
Network
locate a network call error. Search for the HTTP error code.
Resolution for 401 HTTP Error
- The Cluster certificate provided as part of the installation is valid and trusted (not self-signed).
- Researcher Authentication has not been properly configured. Try running
runai login
from the Command-line interface. Alternatively, run:kubectl get pods -n kube-system
, identify the api-server pod and review its logs.
Resolution for 403 HTTP Error
Run: kubectl get pods -n runai
, identify the agent
pod, see that it's running, and review its logs.
New Job button is not showing
Symptom: The New Job
button on the top right of the Job list does not show.
Root Causes: (multiple)
- You do not have
Researcher
orResearch Manager
permissions. - Under
Settings | General
, verify thatUnified UI
is on.
Submit form is distorted
Symptom: Submit form is showing vertical lines.
Root Cause: The control plane does not know the cluster URL.
Using the Run:ai user interface, go to the Clusters list. See that there is no cluster URL next to your cluster.
Resolution: Cluster must be re-installed.
Submit form does not show the list of Projects
Symptom: When connected with Single-sign-on, in the Submit form, the list of Projects is empty.
Root Cause: SSO is on and researcher authentication is not properly configured as such.
Resolution: Verify API Server settings as described in Researcher Authentication configuration.
Job form is not opening on OpenShift
Symptom: When clicking on "New Job" the Job forms does not load. Network shows 405
Root Cause: An installation step has been missed.
Resolution: Open the Cluster list and open the cluster installation wizard again. After selecting OpenShift, you will see a patch
command at the end of the instruction set. Run it.
Networking Issues¶
'admission controller' connectivity issue
Symptoms:
- Pods are failing with 'admission controller' connectivity errors.
- The command-line
runai submit
fails with an 'admission controller' connectivity error. - Agent or cluster sync pods are crashing in self-hosted installation.
Root cause: Connectivity issues between different nodes in the cluster.
Resolution:
- Run the preinstall diagnostics tool to validate System and network requirements and test connectivity issues.
- Run:
kubectl get pods -n kube-system -o wide
. Verify that all networking pods are running. - Run:
kubectl get nodes
. Check that all nodes are ready and connected. - Run:
kubectl get pods -o wide -A
to see which pods are Pending or in Error and which nodes they belong to. - See if pods from different nodes have trouble communicating with each other.
- Advanced, run:
kubectl exec <pod-name> -it /bin/sh
from a pod in one node and ping a pod from another.
Projects are not syncing
Symptom: Create a Project on the Run:ai user interface, then run: runai list projects
. The new Project does not appear.
Root cause: The Run:ai agent is not syncing properly. This may be due to firewall issues.
Resolution
- Run:
runai pods -n runai | grep agent
. See that the agent is in Running state. Select the agent's full name and run:kubectl logs -n runai runai-agent-<id>
. - Verify that there are no errors. If there are connectivity-related errors you may need to check your firewall for outbound connections. See the required permitted URL list in Network requirements.
- If you need to set up an internet proxy or certificate, please contact Run:ai customer support.
Jobs are not syncing
Symptom: A Job on the cluster (runai list jobs
) does not show in the Run:ai user interface Job list.
Root cause: The Run:ai cluster-sync pod is not syncing properly.
Resolution: Search the cluster-sync pod for errors.
Job-related Issues¶
Jobs fail with ContainerCannotRun status
Symptom: When running runai list jobs
, your Job has a status of ContainerCannotRun
.
Root Cause: The issue may be caused due to an unattended upgrade of the NVIDIA driver.
To verify, run: runai describe job <job-name>
, and search for an error driver/library version mismatch
.
Resolution: Reboot the node on which the Job attempted to run.
Going forward, we recommend blacklisting NVIDIA driver from unattended-upgrades.
You can do that by editing /etc/apt/apt.conf.d/50unattended-upgrades
, and adding nvidia-driver-
to the Unattended-Upgrade::Package-Blacklist
section.
It should look something like that:
Inference Issues¶
New Deployment button is grayed out
Symptoms:
- The
New workload type
->Inference
button is grayed out. - Cannot create a deployment via Inference API.
Root Cause: Run:ai Inference prerequisites have not been met.
Resolution: Review inference prerequisites and install accordingly.
Submitted workload type of inference remains in Pending state
Symptom: A submitted inference is not running.
Root Cause: The patch statement to add the runai-scheduler has not been performed.
Workload of type inference status is "Failed"
Symptom: Inference status is always Failed
.
Root Cause: (multiple)
- Not enough resources in the cluster.
- Server model command is misconfigured (i.e sleep infinity).
- Server port is misconfigured.
Worload of type inference does not scale up from zero
Symptom: In the Inference form, when "Auto-scaling" is enabled, and "Minimum Replicas" is set to zero, the inference cannot scale up from zero.
Root Cause:
- Clients are not sending requests.
- Clients are not using the same port/protocol as the server model.
- Server model command is misconfigured (i.e sleep infinity).
Command-line interface Issues¶
Unable to install CLI due to certificate errors
Symptom: The curl command and download button to download the CLI is not working.
Root Cause: The cluster is not accessible from the download location
Resolution:
Use an alternate method for downloading the CLI. Run:
In another shell, run:
When running the CLI you get an error: open .../.kube/config.lock: permission denied
Symptom: When running any CLI command you get a permission denied error.
Root Cause: The user running the CLI does not have read permissions to the .kube
directory.
Resolution: Change permissions for the directory.
When running 'runai logs', the logs are delayed
Symptom: Printout from the container is not immediately shown in the log.
Root Cause: By default, Python buffers stdout, and stderr, which are not flushed in real-time. This may cause logs to appear sometimes minutes after being buffered.
Resolution: Set the env var PYTHONUNBUFFERED to any non-empty string or pass -u to Python. e.g. python -u main.py
.
CLI does not download properly on OpenShift
Symptom: When trying to download the CLI on OpenShift, the wget
statement downloads a text file named darwin
or linux
rather than the binary runai
.
Root Cause: An installation step has been missed.
Resolution: Open the Cluster list and open the cluster installation wizard again. After selecting OpenShift, you will see a patch
command at the end of the instruction set. Run it.