Skip to content

Cluster Install

Below are instructions on how to install a Run:ai cluster.

Prerequisites

Before installing, please review the installation prerequisites listed in Run:ai GPU Cluster Prerequisites.

Important

We strongly recommend running the Run:ai pre-install script to verify that all prerequisites are met.

Install Run:ai

Log in to Run:ai user interface at <company-name>.run.ai. Use credentials provided by Run:ai Customer Support:

  • If no clusters are currently configured, you will see a Cluster installation wizard.
  • If a cluster has already been configured, use the menu on the top left and select Clusters. On the top left, click New Cluster.

Using the cluster wizard:

  • Choose a name for your cluster.
  • Choose the Run:ai version for the cluster.
  • Choose a target Kubernetes distribution (see table for supported distributions).
  • (SaaS and remote self-hosted cluster only) Enter a URL for the Kubernetes cluster. The URL need only be accessible within the organization's network. For more informtaion see here.
  • Press Continue.

On the next page:

  • (SaaS and remote self-hosted cluster only) Install a trusted certificate to the domain entered above.
  • Run the Helm command provided in the wizard.
  • In case of a failure, see the Installation troubleshooting guide.

Verify your cluster's health

  • Verify that the cluster status in the Run:ai Control Plane's Clusters Table is Connected.
  • Go to the Overview Dashboard and verify that the number of GPUs on the top right reflects your GPU resources on your cluster and the list of machines with GPU resources appears on the bottom line.
  • In case of issues, see the Troubleshooting guide.

Researcher Authentication

If you will be using the Run:ai command-line interface or sending YAMLs directly to Kubernetes, you must now set up Researcher Access Control.

Cluster Table

After you have installed your cluster on the platform, you will see it appear in the Cluster Table. The Cluster Table provides a quick and easy way to see the status of your cluster.

In the left menu, press Clusters to view the cluster table. Use Add filter to add one or more filter results based on the columns that are in the table. In the Contains pane, you can use partial or complete text. Filtered text is not case sensitive. To remove the filter, press X next to the filter.

The table provides the following columns:

  • Cluster—the name of the cluster.
  • Kubernetes distribution—the flavor of Kubernetes distribution.
  • Kubernetes version—the version of Kubernetes installed.
  • Status—the status of the cluster. For more information see Cluster status. Hover over the information icon to see a short description and links to troubleshooting.
  • Creation time—the timestamp the cluster was created.
  • URL—the URL that was given to the cluster at the time of creation.
  • Run:ai cluster version—the Run:ai version installed on the cluster.
  • Run:ai cluster UUI—the unique ID of the cluster.

Cluster Status

The following table describes the different statuses that a cluster could be in.

Status Description
Waiting to connect The cluster has never been connected.
Disconnected There is no communication from the cluster to the Control Plane. This may be due to a network issue.
Missing prerequisites At least one of the Mandatory Prerequisites has not been met.
Service issues At least one of the Services is not working properly. You can view the list of nonfunctioning services for more information
Connected All services are connected and up and running.

See the Troubleshooting guide to help troubleshoot issues in the cluster.

Customize your installation

To customize specific aspects of the cluster installation see customize cluster installation.

Set Node Roles (Optional)

When installing a production cluster you may want to:

  • Set one or more Run:ai system nodes. These are nodes dedicated to Run:ai software.
  • Machine learning frequently requires jobs that require CPU but not GPU. You may want to direct these jobs to dedicated nodes that do not have GPUs, so as not to overload these machines.
  • Limit Run:ai to specific nodes in the cluster.

To perform these tasks. See Set Node Roles.

Next Steps