Troubleshooting a Cluster Installation
Make sure only one default storage class is installed¶
The Run:AI Cluster installation includes, by default, a storage class named
local path provisioner which is installed as a default storage class. In some cases your k8s cluster may already have a default storage class installed. In such cases you should disable the local path provisioner. Having two default storage classes will disable both the internal database and some of the metrics.
To disable local path provisioner please run:
kubectl edit runaiconfig -n runai
Add the following lines under
local-path-provisioner: enabled: false
To diagnose this symptom, run:
kubectl describe pod -n runai runai-db-0
See that there is a storage class error.
Pods are not created¶
kubectl get pods -n runai
You will get a list of running "pods". All pods should be with status Running. There could be various reasons why pods are at a different status. Most notably, pods use Run:AI images that are downloaded from the web. If your company employs an outbound firewall, you may need to open the URLs and ports mentioned in Network Requirements
GPU related metrics are not shown¶
In the Admin portal (app.run.ai) Metrics such as "number of GPUs" and "GPU utilization" do not show/
This typically means that there is a disconnect between the Kubernetes pods that require access to GPUs and the NVIDIA software.
This could happen if:
- Connecting GPU data requires a Kubernetes feature gate flag called KubeletPodResources. This feature gate is a default in Kubernetes 1.15, but must be added in older versions of Kubernetes. See https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/. Note that Run:AI has been tested on Kubernetes 1.15.
- Nvidia perquisites are not installed. See Run AI GPU ClusterPrerequisites> for NVIDIA related prerequisites.
To verify whether GPU metrics are exported run:
kubectl port-forward -n runai prometheus-runai-prometheus-operator-prometheus-0 9090
Then using your browser go to http://localhost:9090/ . Verify that you see metrics for dcgm_gpu_utilization.
If no metrics are shown, you can get to the root cause by running:
kubectl get pods -n runai --selector=app=pod-gpu-metrics-exporter
There should be one GPU metrics exporter per node. For each pod run:
kubectl logs -n runai <name> -c pod-nvidia-gpu-metrics-exporter
Where <name> is the pod-gpu-metrics exporter names from above. The logs should contain further data about the issue
A typical issue may be skipping the prerequisite of installing the Nvidia driver plugin. See Installing-Run-AI-on-an-on-premise-Kubernetes-Cluster.