Verifying Cluster Health¶
Following is a set of tests that determine the Run:ai cluster health:
Verify that data is sent to the cloud¶
Log in to <company-name>.run.ai/dashboards/now
.
- Verify that all metrics in the overview dashboard are showing.
- Verify that all metrics are showing in the Nodes view.
- Go to Projects and create a new Project. Find the new Project using the CLI command:
runai list projects
Verify that the Run:ai services are running¶
Run:
Verify that all pods are inRunning
status and a ready state (1/1 or similar) Run:
Check that all deployments are in a ready state (1/1)
Run:
A Daemonset runs on every node. Some of the Run:ai daemon-sets run on all nodes. Others run only on nodes that contain GPUs. Verify that for all daemonsets the desired number is equal to current and to ready.
Submit a Job via the command-line interface¶
Submitting a Job will allow you to verify that the Run:ai scheduling service is in order.
- Make sure that the Project you have created has a quota of at least 1 GPU
- Run:
- Verify that the Job is in a Running state by running:
- Verify that the Job is showing in the Jobs area at
<company-name>.run.ai/jobs
.
Submit a Job via the user interface¶
Log into the Run:ai user interface, and verify that you have a Researcher
or Research Manager
role. Go to the Jobs
area. On the top right, press the button to create a Job. Once the form opens -- submit a Job.