Quickstart: Launch Distributed Training Workloads¶
Distributed Training is the ability to split the training of a model among multiple processors. Each processor is called a worker node. Worker nodes work in parallel to speed up model training. Distributed Training should not be confused with multi-GPU training. Multi-GPU training is the allocation of more than a single GPU to your workload which runs on a single container.
Getting Distributed Training to work is more complex than multi-GPU training as it requires syncing of data and timing between the different workers. However, it is often a necessity when multi-GPU training no longer applies; typically when you require more GPUs than exist on a single node. Several Deep Learning frameworks support Distributed Training. Horovod is a good example.
Run:ai provides the ability to run, manage, and view Distributed Training workloads. The following is a Quickstart document for such a scenario.
To complete this Quickstart you must have:
- Run:ai software installed on your Kubernetes cluster. See: Installing Run:ai on a Kubernetes Cluster
- During the installation, you have installed the Kubeflow MPI Operator as specified here
- Run:ai CLI installed on your machine. See: Installing the Run:ai Command-Line Interface
Step by Step Walkthrough¶
- Login to the Projects area of the Run:ai user interface.
- Add a Project named "team-a".
- Allocate 2 GPUs to the Project.
Run Training Distributed Workload¶
- At the command-line run:
- We named the Job dist
- The Job is assigned to team-a
- There will be two worker pods (--workers=2), each allocated with a single GPU (-g 1)
- The Job is based on a sample docker image
- The image contains a startup script that runs a deep learning Horovod-based workload.
Follow up on the Job's status by running:
runai list jobs
The Run:ai scheduler ensures that all pods can run together. You can see the list of workers as well as the main "launcher" pod by running:
runai describe job dist
You will see two worker pods, their status, and on which node they run:
To see the merged logs of all pods run:
runai logs dist
Finally, you can delete the distributed training workload by running:
runai delete job dist
Run an Interactive Distributed Workload¶
It is also possible to run a distributed training Job as "interactive". This is useful if you want to test your distributed training Job before committing on a long, unattended training session. To run such a session use:
When the workers are running run:
runai bash dist-int
This will provide shell access to the launcher process. From there, you can run your distributed workload. For Horovod version smaller than 0.17.0 run:
For Horovod version 0.17.0 or later, add the
-hostfile flag as follows:
The environment variable
RUNAI_MPI_NUM_WORKERS is passed by Run:ai and contains the number of workers provided to the
runai submit-dist mpi command (in the above example the value is 2).
- The source code of the image used in this Quickstart document is in Github
- For a full list of the
submit-dist mpioptions see runai submit-dist mpi