Preparing for a Run:ai OpenShift installation¶
The following section provides IT with the information needed to prepare for a Run:ai installation.
Prerequisites¶
See the Prerequisites section above.
Software artifacts¶
You should receive a file: runai-reg-creds.yaml
from Run:ai Customer Support. The file provides access to the Run:ai Container registry.
SSH into a node with oc
access (oc
is the OpenShift command line) to the cluster and Docker
installed.
Run the following to enable image download from the Run:ai Container Registry on Google cloud:
You should receive a single file runai-<version>.tar
from Run:ai customer support
Run:ai assumes the existence of a Docker registry for images. Most likely installed within the organization. The installation requires the network address and port for the registry (referenced below as <REGISTRY_URL>
).
SSH into a node with oc
access (oc
is the OpenShift command line) to the cluster and Docker
installed.
To extract Run:ai files, replace <VERSION>
in the command below and run:
Upload images to a local Docker Registry. Set the Docker Registry address in the form of NAME:PORT
(do not add https
):
Run the following script (you must have at least 20GB of free disk space to run):
(If docker is configured to run as non-root then sudo
is not required).
The script should create a file named custom-env.yaml which will be used by the control-plane installation.
Private Docker Registry (optional)¶
To access the organization's docker registry it is required to set the registry's credentials (imagePullSecret)
Create the secret named runai-reg-creds
in the runai-backend
namespace based on your existing credentials. The configuration will be copied over to the runai
namespace at cluster install. For more information, see Allowing pods to reference images from other secured registries.
Configure your environment¶
Create OpenShift project¶
The Run:ai control plane uses a namespace (or project in OpenShift terminology) name runai-backend
. You must create it before installing:
Local Certificate Authority (air-gapped only)¶
In Air-gapped environments, you must prepare the public key of your local certificate authority as described here. It will need to be installed in Kubernetes for the installation to succeed.
Mark Run:ai system workers (optional)¶
You can optionally set the Run:ai control plane to run on specific nodes. Kubernetes will attempt to schedule Run:ai pods to these nodes. If lacking resources, the Run:ai nodes will move to another, non-labeled node.
To set system worker nodes run:
Warning
Do not select the Kubernetes master as a runai-system
node. This may cause Kubernetes to stop working (specifically if Kubernetes API Server is configured on 443 instead of the default 6443).
Additional permissions¶
As part of the installation, you will be required to install the Control plane and Cluster Helm Charts. The Helm Charts require Kubernetes administrator permissions. You can review the exact permissions provided by using the --dry-run
on both helm charts.
Validate prerequisites¶
Once you believe that the Run:ai prerequisites and preperations are met, we highly recommend installing and running the Run:ai pre-install diagnostics script. The tool:
- Tests the below requirements as well as additional failure points related to Kubernetes, NVIDIA, storage, and networking.
- Looks at additional components installed and analyzes their relevancy to a successful Run:ai installation.
To use the script download the latest version of the script and run:
If the script fails, or if the script succeeds but the Kubernetes system contains components other than Run:ai, locate the file runai-preinstall-diagnostics.txt
in the current directory and send it to Run:ai technical support.
For more information on the script including additional command-line flags, see here.
Next steps¶
Continue with installing the Run:ai Control Plane.