Skip to content

Walk-through: Elasticity, Dynamically Stretch or Compress Workload's GPU Allocation

Introduction

Elasticity allows unattended, train-based workloads to shrink or expand based on the cluster's availability.

  • Shrinking a training job allows your workload to run on a smaller number of GPUs than the researcher code was originally written for.
  • Expanding a training job allows your workload to run on more GPUs than the researcher code was originally written for.

Prerequisites

To complete this walk-through you must have:

Step by Step Walk-through

Setup

  • A GPU cluster with a single node of 2 GPUs.

    • If the cluster contains more than one node, use Node affinity to simulate a single node or use more filler jobs as described below.
    • If the cluster nodes contain more than 2 GPUs, you can create an interactive job on a different project to consume the remaining GPUs.
  • Open the Run:AI user interface at https://app.run.ai

  • Login
  • Go to "Projects"
  • Add a project named "team-a"
  • Allocate 2 GPUs to the project

Expansion

  • At the command-line run:

    runai project set team-a
    runai submit elastic1 -i gcr.io/run-ai-demo/quickstart -g 1 --elastic
    
  • This would start an unattended training job for team-a

  • The job is based on a sample docker image gcr.io/run-ai-demo/quickstart. We named the job elastic1and have requested 1 GPU for the job
  • The flag --elastic enables the Elasticity feature
  • Follow up on the job's progress by running:

    runai list
    

    The result:

    elasticity1.png

    Discussion

    • The Job has requested 1 GPU, but has been allocated with 2, as 2 are available right now.
    • The code needs to be ready to accept more GPUs than it requested, otherwise, the GPUs will not be utilized. The Run:AI Elasticity library helps with expanding the job effectively.
  • Add a filler class:

    runai submit filler1 -i ubuntu --command sleep --args infinity -g 1 --interactive
    runai list
    

    The result:

    elasticity4.png

    Discussion

    An interactive job (filler1) needs to be scheduled. The elastic job is now reduced to the originally requested single-GPU.

  • Finally, delete the jobs:

    runai delete elastic1 filler1
    

Shrinking

  • At the command-line run:

    runai submit filler2 -i ubuntu --command sleep --args infinity -g 1 --interactive
    runai submit elastic2 -i gcr.io/run-ai-demo/quickstart -g 2 --elastic
    
  • This would start a filler job on 1 GPU and attempt to start another unattended job with 2 GPUs

  • Follow up on the job's progress by running:

    runai list
    

    The result:

    elasticity2.png

    Discussion

    Since only a single GPU remains unallocated, under normal circumstances, the job should not start. However, the --elastic flag tells the system to allocate a single GPU instead.

  • Delete the filler job and list the jobs again:

    runai delete filler2
    runai list
    

    The result:

    elasticity3.png

    Discussion

    With the filler job gone, the elastic job has more room to expand, which it does.

  • Finally, delete the job:

    runai delete elastic2
    

See Also


Last update: September 2, 2020