Skip to content

Troubleshooting a Cluster Installation

Make sure only one default storage class is installed

The Run:AI Cluster installation includes, by default, a storage class named local path provisioner which is installed as a default storage class. In some cases your k8s cluster may already have a default storage class installed. In such cases you should disable the local path provisioner. Having two default storage classes will disable both the internal database and some of the metrics.

To disable local path provisioner please run:

  kubectl edit runaiconfig -n runai

Add the following lines under spec:

  local-path-provisioner:
     enabled: false

To diagnose this symptom, run:

  kubectl describe pod -n runai runai-db-0

See that there is a storage class error.

Pods are not created

run:

kubectl get pods -n runai

You will get a list of running "pods". All pods should be with status Running. There could be various reasons why pods are at a different status. Most notably, pods use Run:AI images that are downloaded from the web. If your company employs an outbound firewall, you may need to open the URLs and ports mentioned in Network Requirements

In the Admin portal (app.run.ai) Metrics such as "number of GPUs" and "GPU utilization" do not show/

This typically means that there is a disconnect between the Kubernetes pods that require access to GPUs and the NVIDIA software.

This could happen if:

To verify whether GPU metrics are exported run:

kubectl port-forward -n runai prometheus-runai-prometheus-operator-prometheus-0 9090

Then using your browser go to http://localhost:9090/ . Verify that you see metrics for dcgm_gpu_utilization.

mceclip1.png

If no metrics are shown, you can get to the root cause by running:

kubectl get pods -n runai --selector=app=pod-gpu-metrics-exporter

There should be one GPU metrics exporter per node. For each pod run:

kubectl logs -n runai <name> -c pod-nvidia-gpu-metrics-exporter

Where <name> is the pod-gpu-metrics exporter names from above. The logs should contain further data about the issue

A typical issue may be skipping the prerequisite of installing the Nvidia driver plugin. See Installing-Run-AI-on-an-on-premise-Kubernetes-Cluster.


Last update: July 28, 2020