Skip to content

Troubleshooting

Determining Cluster Health

Following are a set of tests to run to determine cluster health:

1. Verify that data is sent to the cloud

Log in to https://app.run.ai/dashboards/now

  • Verify that all metrics in the overview dashboard are showing. Specifically the list of nodes and the numeric indicators
  • Go to Projects and create a new Project. Find the new Project using the CLI command:

2. Verify that the Run:AI services are running

Run:

kubectl get pods -n runai

Verify that all pods are in Running status.

Run:

kubectl get deployments -n runai

Check that all deployments are in a ready state (1/1)

Run:

kubectl get daemonset -n runai

A Daemonset runs on every node. Some of the Run:AI daemon-sets run on all nodes. Others run only on nodes that contain GPUs. Verify that for all daemon-sets the desired number is equal to current and to ready.

runai list projects

3. Submit a Job

Submitting a Job will allow you to verify that the Run:AI scheduling service is in order.

  • Make sure that the Project you have created has a quota of at least 1 GPU
  • Run:
runai config project <project-name>
runai submit job1 -i gcr.io/run-ai-demo/quickstart -g 1
  • Verify that the Job is a Running state when running:
runai list jobs

Symptom: Metrics are not showing on Overview Dashboard

Some or all metrics are not showing in https://app.run.ai/dashboards/now

Typical root causes:

  • NVIDIA prerequisites have not been met.
  • Firewall-related issues.
  • Internal clock is not synced.

NVIDIA prerequisites have not been met

Run:

runai pods -n runai | grep nvidia

Select one of the NVIDIA pods and run:

kubectl logs -n runai nvidia-device-plugin-daemonset-<id>

If the log contains an error, it means that NVIDIA-related prerequisites have not been met. Review step 1 in NVIDIA prerequisites. Verify that:

  • Step 1.1: CUDA Toolkit is installed
  • Step 1.2: NVIDIA Docker is installed. A typical issue here is the installation of the NVIDIA Container Toolkit instead of NVIDIA Docker 2.
  • Step 1.3: Verify that NVIDIA Docker is the default docker runtime
  • If the system has recently been installed, verify that docker has restarted by running the aforementioned pkill command
  • Check the status of Docker by running:
sudo systemctl status docker

Firewall issues

Add verbosity to Prometheus by editing RunaiConfig:

kubectl edit runaiconfig runai -n runai

Add a debug log level:

prometheus-operator:
  prometheus:
    prometheusSpec:
      logLevel: debug

Run:

kubectl logs  prometheus-runai-prometheus-operator-prometheus-0 prometheus \
      -n runai -f --tail 100

Verify that there are no errors. If there are connectivity related errors you may need to:

Machine Clocks are not synced

Run: date on cluster nodes and verify that date/time is correct. If not,

  • Set the Linux time service (NTP).
  • Restart Run:AI services. Depending on the previous time gap between servers, you may need to reinstall the Run:AI cluster

Symptom: Projects are not syncing

Create a Project on the Admin UI, then run: runai list projects. The new Project does not appear.

Typical root cause: The Run:AI agent is not syncing properly. This may be due to:

  • A dependency on the internal Run:AI database. See separate symptom below
  • Firewall issues

Run:

  runai pods -n runai | grep agent

See if the agent is in Running state. Select the agent's full name and run:

  kubectl logs -n runai runai-agent-<id>

Verify that there are no errors. If there are connectivity related errors you may need to:

Symptom: Internal Database has not started

Run:

runai pods -n runai | grep runai-db-0
The status of the Run:AI database is not Running

Typical root causes:

  • More than one default storage class is installed
  • Incompatible NFS version

More than one default storage class is installed

The Run:AI Cluster installation includes, by default, a storage class named local path provisioner which is installed as a default storage class. In some cases, your k8s cluster may already have a default storage class installed. In such cases you should disable the local path provisioner. Having two default storage classes will disable both the internal database and some of the metrics.

Run:

  kubectl get storageclass

And look for default storage classes.

Run:

  kubectl describe pod -n runai runai-db-0

See that there is indeed a storage class error appearing

To disable local path provisioner, run:

  kubectl edit runaiconfig -n runai

Add the following lines under spec:

local-path-provisioner:
      enabled: false

Incompatible NFS version

Default NFS Protocol level is currently 4. If your NFS requires an older version, you may need to add the option as follows. Run:

kubectl edit runaiconfig runai -n runai

Add mountOptions as follows:

nfs-client-provisioner:
  nfs: 
    mountOptions: ["nfsvers=3"]

Symptom: Cluster Installation failed on Rancher-based Kubernetes (RKE)

Cluster is not installed. When running kubectl get pods -n runai you see that pod init-ca has not started

Resolution

During initialization, Run:AI creates a Certificate Signing Request (CSR) which needs to be approved by the cluster's Certificate Authority (CA). In RKE, this is not enabled by default, and the paths to your Certificate Authority's keypair must be referenced manually by adding the following parameters inside your cluster.yml file, under kube-controller:

kube-controller:
 extra_args:
  cluster-signing-cert-file: /etc/kubernetes/ssl/kube-ca.pem
  cluster-signing-key-file: /etc/kubernetes/ssl/kube-ca-key.pem

For further information see here.

Diagnostic Tools

Adding Verbosity to Database container

Run:

kubectl edit runaiconfig runai -n runai

Under spec, add:

spec:
  postgresql:
    image:
      debug: true

Then view the log by running:

kubectl logs -n runai runa-db-0 

Internal Networking Issues

Run:AI is based on Kubernetes. Kubernetes runs its own internal subnet with a separate DNS service. If you see in the logs that services have trouble connecting, the problem may reside there. You can find further information on how to debug Kubernetes DNS here. Specifically, it is useful to start a Pod with networking utilities and use it for network resolution:

kubectl run -i --tty netutils --image=dersimn/netutils -- bash

Last update: February 28, 2021