Troubleshooting
Determining Cluster Health¶
Following are a set of tests to run to determine cluster health:
1. Verify that data is sent to the cloud¶
Log in to <company-name>.run.ai/dashboards/now
.
- Verify that all metrics in the overview dashboard are showing. Specifically the list of nodes and the numeric indicators
- Go to Projects and create a new Project. Find the new Project using the CLI command:
runai list projects
2. Verify that the Run:ai services are running¶
Run:
- Verify that all pods are in
Running
status and in a ready state (1/1 or similar) - Identify the
runai-db-0
andrunai-agent-<id>
pods. Run:kubectl logs -n runai <pod name>
and verify that there are no errors.
Run:
Check that all deployments are in a ready state (1/1)
Run:
A Daemonset runs on every node. Some of the Run:ai daemon-sets run on all nodes. Others run only on nodes that contain GPUs. Verify that for all daemonsets the desired number is equal to current and to ready.
Run:
Create a Project using the Run:ai user interface and verify that the Project is reflected in the above command.
3. Submit a Job¶
Submitting a Job will allow you to verify that the Run:ai scheduling service is in order.
- Make sure that the Project you have created has a quota of at least 1 GPU
- Run:
- Verify that the Job is a Running state when running:
- Verify that the Job is showing in the Jobs area at
<company-name>.run.ai/jobs
.
Symptom: Metrics are not showing on Overview Dashboard¶
Some or all metrics are not showing in <company-name>.run.ai/dashboards/now
Typical root causes:
- NVIDIA prerequisites have not been met.
- Firewall-related issues.
- Internal clock is not synced.
- Prometheus pods are not running
NVIDIA prerequisites have not been met
Run:
Select one of the NVIDIA pods and run:
If the log contains an error, it means that NVIDIA-related prerequisites have not been met. Review NVIDIA prerequisites. Verify that:
- Step 1.1: CUDA Toolkit is installed
- Step 1.2: NVIDIA Docker is installed. A typical issue here is the installation of the NVIDIA Container Toolkit instead of NVIDIA Docker 2.
- Step 1.3: Verify that NVIDIA Docker is the default docker runtime
- If the system has recently been installed, verify that docker has restarted by running the aforementioned
pkill
command - Check the status of Docker by running:
Firewall issues
Add verbosity to Prometheus by editing RunaiConfig:
Add a debug
log level:
Run:
Verify that there are no errors. If there are connectivity related errors you may need to:
- Check your firewall for outbound connections. See the required permitted URL list in: Network requirements.
- If you need to set up an internet proxy or certificate, please contact Run:ai customer support.
Machine Clocks are not synced
Run: date
on cluster nodes and verify that date/time is correct. If not,
- Set the Linux time service (NTP).
- Restart Run:ai services. Depending on the previous time gap between servers, you may need to reinstall the Run:ai cluster
Prometheus pods are not running
Run: kubectl get pods -n monitoring -o wide
- Verify that all pods are running.
- The default Prometheus installation is not built for high availability. If a node is down, the Prometheus pod will not recover by itself unless manually deleted. Delete the pod to see it start on a different node and consider adding a second replica to Prometheus.
Symptom: Projects are not syncing¶
Create a Project on the Run:ai user interface, then run: runai list projects
. The new Project does not appear.
Typical root cause: The Run:ai agent is not syncing properly. This may be due to:
- A dependency on the internal Run:ai database. See separate symptom below
- Firewall issues
Run:
runai pods -n runai | grep agent
See if the agent is in Running state. Select the agent's full name and run:
kubectl logs -n runai runai-agent-<id>
Verify that there are no errors. If there are connectivity related errors you may need to:
- Check your firewall for outbound connections. See the required permitted URL list in: Network requirements.
- If you need to setup an internet proxy or certificate, please contact Run:ai customer support.
Symptom: Cluster Installation failed on Rancher-based Kubernetes (RKE)¶
Cluster is not installed. When running kubectl get pods -n runai
you see that pod init-ca
has not started
Resolution
During initialization, Run:ai creates a Certificate Signing Request (CSR) which needs to be approved by the cluster's Certificate Authority (CA). In RKE, this is not enabled by default, and the paths to your Certificate Authority's keypair must be referenced manually by adding the following parameters inside your cluster.yml file, under kube-controller:
services:
kube-controller:
extra_args:
cluster-signing-cert-file: /etc/kubernetes/ssl/kube-ca.pem
cluster-signing-key-file: /etc/kubernetes/ssl/kube-ca-key.pem
For further information see here.
Symptom: Jobs fail with ContainerCannotRun status¶
When running runai list jobs
, your Workload has a status of ContainerCannotRun
.
Resolution
The issue may be caused due to an unattended upgrade of the NVIDIA driver.
To verify, run runai describe job <job-name>
, and search for an error driver/library version mismatch
.
To fix: reboot the node on which the Job attempted to run.
Going forward, we recommend blacklisting NVIDIA driver from unattended upgrades.
You can do that by editing /etc/apt/apt.conf.d/50unattended-upgrades
, and adding nvidia-driver-
to the Unattended-Upgrade::Package-Blacklist
section.
It should look something like that:
Unattended-Upgrade::Package-Blacklist {
// The following matches all packages starting with linux-
// "linux-";
"nvidia-driver-";
Diagnostic Tools¶
Adding Verbosity to Database container¶
Run:
Under spec
, add:
Then view the log by running:
Internal Networking Issues¶
Run:ai is based on Kubernetes. Kubernetes runs its own internal subnet with a separate DNS service. If you see in the logs that services have trouble connecting, the problem may reside there. You can find further information on how to debug Kubernetes DNS here. Specifically, it is useful to start a Pod with networking utilities and use it for network resolution: