Skip to content

Prerequisites

Below are the prerequisites of a cluster installed with Run:AI.

Kubernetes Software

Run:AI requires Kubernetes 1.15 or above. Kubernetes 1.17 is recommended (as of June 2020).

If you are using Red Hat OpenShift. The minimal version is OpenShift 4.3 which runs on Kubernetes 1.16.

NVIDIA Driver

Run:AI requires all GPU nodes to be installed with NVIDIA driver version 384.81 or later due to this dependency.

Hardware Requirements

  • Kubernetes: Dedicated CPU-only worker node: To save on expensive GPUs-based hardware, we recommend (though not a must), a dedicated, CPU-only worker machine. Run:AI requires the following resources:

    • 4 CPUs
    • 4GB of RAM
    • At least 20GB of Disk space
  • Shared data volume: Run:AI, via Kubernetes, abstracts away the machine on which a container is running. For containers to run anywhere, they need to be able to access data from any machine in a uniform way in order to access training data and code as well as save checkpoints, weights and other machine-learning related artifacts. Typically, this requires a NAS (Network-attached storage) which allows any node to connect to storage outside the box or some object storage to achieve a similar purpose.

  • Docker Registry With Run:AI, Workloads are based on Docker images. For container images to run on any machine, these images must be downloaded from a docker registry rather than reside on the local machine (though this also is possible). You can use a public registry such as docker hub or set up a local registry on-premise. Run:AI can assist with setting up the repository.

Network Requirements

Run:AI user interface runs from the cloud. All container nodes must be able to connect to the Run:AI cloud. Inbound connectivity (connecting from the cloud into nodes) is not required. If outbound connectivity is proxied/limited, the following exceptions should be applied:

During Installation

Run:AI requires an installation over the Kubernetes cluster. The installation access the web to download various images and registries. Some organizations place limitations on what you can pull from the internet. The following list shows the various solution components and their origin:

Name Description URLs Ports

Run:AI Repository

The Run:AI Package Repository is hosted on Run:AI’s account on Google Cloud

runai-charts.storage.googleapis.com

443

Docker Images Repository

Various Run:AI images

hub.docker.com

gcr.io/run-ai-prod

443

Docker Images Repository

Various third party Images

quay.io

443

Post Installation

In addition, once running, Run:AI will send metrics to two sources:

Name Description URLs Ports

Grafana

Grafana Metrics Server

prometheus-us-central1.grafana.net

443

Run:AI

Run:AI Cloud instance

app.run.ai

443

User requirements

Usage of containers and images: The individual researcher's work is based on container images. Containers allow IT to create standard software environments based on mix and match of various cutting-edge software.

Fractional GPU Requirements

The Run:AI platform provides a unique technology that allows the sharing of a single GPU between multiple containers. Each container receives an isolated subset of the GPU memory. For more details see Walk-through: Using GPU Fractions.

This technology has more stringent software requirements than the rest of the Run:AI system. Specifically, virtualization has been tested on:

  • NVIDIA device driver 410.104 or later
  • CUDA 9.0 or later
  • TensorFlow, Keras or PyTorch

We keep testing the technology on additional software.


Last update: August 19, 2020