Skip to content

Walk-through: Over-Quota and Bin Packing

Goals

The goal of this walk-through is to explain the concepts of over-quota and bin-packing (consolidation) and how they help in maximizing cluster utilization:

  • Show simplicity of resource provisioning, and how resources are abstracted from users.
  • Show how the system eliminates compute bottlenecks by allowing teams/users to go over their resource quota if there are free GPUs in the cluster.

Setup and configuration:

  • 4 GPUs on 2 machines with 2 GPUs each
  • 2 Projects: team-a and team-b with 2 allocated GPUs each
  • Run:AI canonical image gcr.io/run-ai-demo/quickstart

Part I: Over-quota

Run the following commands:

runai submit a2 -i gcr.io/run-ai-demo/quickstart -g 2 -p team-a
runai submit a1 -i gcr.io/run-ai-demo/quickstart -g 1 -p team-a
runai submit b1 -i gcr.io/run-ai-demo/quickstart -g 1 -p team-b

System status after run: overquota1

Discussion

  • team-a has 3 GPUs allocated. Which is over its quota by 1 GPU.
  • The system allows this over-quota as long as there are available resources
  • The system is at full capacity with all GPUs utilized.

Part 2: Basic Fairness via Preemption

Run the following command:

runai submit b2 -i gcr.io/run-ai-demo/quickstart -g 1 -p team-b

System status after run: overquota2

Discussion

  • team-a can no longer remain in over-quota. Thus, one job, must be preempted: moved out to allow team-b to grow.
  • Run:AI scheduler chooses to preempt job a1.
  • It is important that unattended jobs will save checkpoints. This will ensure that whenever job a1 resume, it will do so from where it left off.

Part 3: Bin Packing

Run the following command:

runai delete a2 -p team-a

a1 is now going to start running again.

Run: runai list -A

You have two jobs that are running on the first node and one job that is running alone the second node.

Choose one of the two job from the full node and delete it:

runai delete <job-name> -p <project>

The status now is: overquota3

Now, run a 2 GPU job:

runai submit a2 -i gcr.io/run-ai-demo/quickstart -g 2 -p team-a

The status now is: overquota4

Discussion

Note that job a1 has been preempted and then restarted on the second node, in order to clear space fo the new a2 job. This is bin-packing or consolidation


Last update: August 4, 2020