Skip to content

Prerequisites

Below are the prerequisites of a cluster installed with Run:AI.

Kubernetes Software

Run:AI requires Kubernetes 1.16 or above. Kubernetes 1.18 is recommended (as of October 2020).

If you are using Red Hat OpenShift. The minimal version is OpenShift 4.3 which runs on Kubernetes 1.16.

NVIDIA Driver

Run:AI requires all GPU nodes to be installed with NVIDIA driver version 384.81 or later due to this dependency.

Hardware Requirements

  • (Production only) Dedicated Run:AI System Nodes: To reduce downtime and save CPU cycles on expensive GPU Machines, we recommend that production deployments will contain at least one, dedicated worker machine, designated for Run:AI Software:

    • 4 CPUs
    • 8GB of RAM
    • 50GB of Disk space
  • Shared data volume: Run:AI uses Kubernetes to abstract away the machine on which a container is running:

    • Researcher containers: The Researcher's containers need to be able to access data from any machine in a uniform way, so as to access training data and code as well as save checkpoints, weights, and other machine-learning related artifacts.
    • The Run:AI system needs to save data on a storage device that is not dependent on a specific node.

    Typically, this is achieved via Network File Storage (NFS) or Network-attached storage (NAS). NFS is usually the preferred method for Researchers which may require multi-read/write capabilities.

  • Docker Registry With Run:AI, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-premise (preferably on a dedicated machine). Run:AI can assist with setting up the repository.

  • Kubernetes: Though out of scope for this document, Production Kubernetes installation requires separate nodes for the Kubernetes master.

Network Requirements

Run:AI user interface runs from the cloud. All container nodes must be able to connect to the Run:AI cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is proxied/limited, the following exceptions should be applied:

During Installation

Run:AI requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:

Name Description URLs Ports

Run:AI Repository

The Run:AI Package Repository is hosted on Run:AI’s account on Google Cloud

runai-charts.storage.googleapis.com

443

Docker Images Repository

Various Run:AI images

hub.docker.com

gcr.io/run-ai-prod

443

Docker Images Repository

Various third party Images

quay.io

443

Post Installation

In addition, once running, Run:AI will send metrics to two sources:

Name Description URLs Ports

Grafana

Grafana Metrics Server

prometheus-us-central1.grafana.net

443

Run:AI

Run:AI Cloud instance

app.run.ai

443

User requirements

Usage of containers and images: The individual Researcher's work is based on container images. Containers allow IT to create standard software environments based on mix and match of various cutting-edge software.


Last update: January 3, 2021