Quickstart: Over-Quota and Bin Packing¶

Goals¶

The goal of this Quickstart is to explain the concepts of over-quota and bin-packing (consolidation) and how they help in maximizing cluster utilization:

Show the simplicity of resource provisioning, and how resources are abstracted from users.
Show how the system eliminates compute bottlenecks by allowing teams/users to go over their resource quota if there are free GPUs in the cluster.

Setup and configuration:¶

To complete this Quickstart, the Platform Administrator will need to provide you with:

Your cluster should have 4 GPUs on 2 machines with 2 GPUs each.
Researcher access to two Projects named "team-a" and "team-b"
Each project should be assigned an exact quota of 2 GPUs.
A URL of the Run:ai Console. E.g. https://acme.run.ai.
Run:ai CLI installed on your machine. There are two available CLI variants:
- The older V1 CLI. See installation here
- A newer V2 CLI, supported with clusters of version 2.18 and up. See installation here

Run runai login and enter your credentials.

Part I: Over-quota¶

Open a terminal and run the following command:

CLI V2CLI V1 (Deprecated)

runai training submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a
runai training submit a1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-a
runai training submit b1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b

runai submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a
runai submit a1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-a
runai submit b1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b

System status after run:

Discussion

team-a has 3 GPUs allocated. Which is over its quota by 1 GPU.
The system allows this over-quota as long as there are available resources
The system is at full capacity with all GPUs utilized.

Part 2: Basic Fairness via Preemption¶

Run the following command:

CLI V2CLI V1 (Deprecated)

runai training submit b2 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b

runai submit b2 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b

System status after run:

Discussion

team-a can no longer remain in over-quota. Thus, one Job, must be preempted: moved out to allow team-b to grow.
Run:ai scheduler chooses to preempt Job a1.
It is important that unattended Jobs will save checkpoints. This will ensure that whenever Job a1 resume, it will do so from where it left off.

Part 3: Bin Packing¶

Run the following command:

CLI V2CLI V1 (Deprecated)

runai training delete a2

runai delete job a2 -p team-a

a1 is now going to start running again.

Run:

CLI V2CLI V1 (Deprecated)

runai training list -A

runai list jobs -A

You have two Jobs that are running on the first node and one Job that is running alone the second node.

Choose one of the two Jobs from the full node and delete it:

CLI V2CLI V1 (Deprecated)

runai training delete <job-name> -p <project>

runai delete job <job-name> -p <project>

The status now is:

Now, run a 2 GPU Job:

CLI V2CLI V1 (Deprecated)

runai training submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a

runai submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a

_ The status now is:

Discussion

Note that Job a1 has been preempted and then restarted on the second node, to clear space for the new a2 Job. This is bin-packing or consolidation

Quickstart: Over-Quota and Bin Packing¶

Goals¶

Setup and configuration:¶

Login¶

Part I: Over-quota¶

Part 2: Basic Fairness via Preemption¶

Part 3: Bin Packing¶