Skip to content

Integrate Run:AI with Kubeflow

Kubeflow is a platform for data scientists who want to build and experiment with ML pipelines. Kubeflow is also for ML engineers and operational teams who want to deploy ML systems to various environments for development, testing, and production-level serving.

This document describes the process of using Kubeflow in conjunction with Run:AI. Kubeflow submits jobs that are scheduled via Run:AI.

Kubeflow is a set of technologies. This document discusses Kubeflow Notebooks and Kubeflow Pipelines.

Install Kubeflow

Use the default installation to install Kubeflow.

Install Run:AI Cluster

When installing Run:AI, customize the cluster installation as follows:

  • Set mpi to false as it conflicts with Kubeflow.
  • Set createNamespaces to false, as Kubeflow uses its own namespace convention.

Create Run:AI Projects

Kubeflow uses the namespace convension kubeflow-<username>. Use the 4 steps here to set up Run:AI projects and link them with Kubeflow namespaces.

Verify that the association has worked by running:

kubectl get rolebindings -n <KUBEFLOW-NAMESPACE>

See that role bindings starting with runai- were created.

Kubeflow, Users and Kubernetes Namespaces

Kubeflow has a multi-user architecture. A user has a Kubeflow profile which maps to a Kubernetes Namespace. This is similar to the Run:AI concept where a Run:AI Project is mapped to a Kubernetes namespace.

Kubeflow Notebooks

When starting a Kubeflow Notebook, you select a Kubeflow configuration. A Kubeflow configuration allows you to inject additional settings into the notebook, such as environment variables. To use Kubeflow with Run:AI you will use configurations to inject:

  • The name of the Run:AI project
  • Allocation of a fraction of a GPU, if required

Whole GPUs

To use Run:AI with whole GPUs (no fractions), apply the following configuration:

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: runai-non-fractional
  namespace: <KUBEFLOW-NAMESPACE>
spec:
  desc: "Use Run:AI scheduler (whole GPUs)"
  env:
    - name: RUNAI_PROJECT 
      value: "<PROJECT>"
  selector:
    matchLabels:
      runai-non-fractional: "true"  # key must be identical to metadata.name

Where <KUBEFLOW-NAMESPACE> is the name of the namespace associated with the Kubeflow user and <PROJECT> is the name of the Run:AI project.

Within the Kubeflow Notebook creation form, select the new configuration as well as the number of GPUs required.

Fractions

The Kubeflow Notebook creation form only allows the selection of 1, 2, 4, or 8 GPUs. It is not possible to select a portion of a GPU (e.g. 0.5). As such, within the form, select None in the GPU box together with the following configuration:

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: runai-half-gpu
  namespace: <KUBEFLOW-NAMESPACE>
spec:
  desc: "Allocate 0.5 GPUs via Run:AI scheduler"
  env:
    - name: RUNAI_PROJECT 
      value: "<PROJECT>"
    - name: RUNAI_GPU_FRACTION
      value: "0.5"
  selector:
    matchLabels:
      runai-half-gpu: "true"  # key must be identical to metadata.name
Similar configurations can be created for fractional configurations, other than 0.5.

Kubeflow Pipelines

Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.

As with Kubeflow Notebooks, the goal of this section is to run pipelines jobs within the context of Run:AI.

To create a Kubeflow pipeline, you:

  • Write code using the Kubeflow Pipeline SDK.
  • Package it into a single compressed file.
  • Upload the file into Kubeflow and set it up.

The example code provided here shows how to augment pipeline code to use Run:AI

Whole GPUs

To the pipeline code add:

_training = training_op()
...
_training.add_pod_label('runai', 'true')
_training.add_pod_label('project', '<PROJECT>')

Where <Project> is the Run:AI project name. See example code here

Compile the code by running:

dsl-compile --py kubeflow-runai-one-gpu.py --output kubeflow-runai-one-gpu.tar.gz
(dsl-compile is part of the Kubeflow Pipeline Python SDK).

Fractions

To allocate half a GPU, add the following to the pipeline code:

_training = training_op()
...
_training.add_pod_label('runai', 'true')
_training.add_pod_label('project', '<PROJECT>')
_training.add_pod_annotation('gpu-fraction', '0.5')

Where <Project> is the Run:AI project name. See example code here.

Compile the code as described above.


Last update: August 6, 2021