Verifying Cluster Health¶
Following is a set of tests that determine the Run:ai cluster health:
Verify that data is sent to the cloud¶
Log in to
- Verify that all metrics in the overview dashboard are showing.
- Verify that all metrics are showing in the Nodes view.
- Go to Projects and create a new Project. Find the new Project using the CLI command:
runai list projects
Verify that the Run:ai services are running¶
Running status and a ready state (1/1 or similar)
Check that all deployments are in a ready state (1/1)
A Daemonset runs on every node. Some of the Run:ai daemon-sets run on all nodes. Others run only on nodes that contain GPUs. Verify that for all daemonsets the desired number is equal to current and to ready.
Submit a Job via the command-line interface¶
Submitting a Job will allow you to verify that the Run:ai scheduling service is in order.
- Make sure that the Project you have created has a quota of at least 1 GPU
- Verify that the Job is in a Running state by running:
- Verify that the Job is showing in the Jobs area at
Submit a Job via the user interface¶
Log into the Run:ai user interface, and verify that you have a
Research Manager role. Go to the
Jobs area. On the top right, press the button to create a Job. Once the form opens -- submit a Job.