Researcher Library: Dynamically Stretch or Compress Workload's GPU Quota¶
The Run:AI Researcher Library is a python library you can add to your deep learning python code. The library contains an elasticity module which allows train workloads to shrink or expand based on the cluster's availability
Shrinking a Workload¶
Shrinking a training job allows your workload to run on a smaller number of GPUs than the researcher code was originally written for. This is useful for maximizing utilization of the cluster as a whole as well as allowing a researcher to run, albeit slower than intended.
Shrinking a training job uses an algorithm called Gradient Accumulation. For more information about the algorithm see: https://towardsdatascience.com/what-is-gradient-accumulation-in-deep-learning-ec034122cfa
Expanding a Workload¶
Expanding a training job allows your GPUs to runs on more GPUs than the researcher code was originally written for. This is useful for maximizing the utilization of the cluster as a whole as well as allowing a researcher to run faster if idle GPUs exist in the cluster. The extra GPUs will be automatically reclaimed if needed by other, prioritized jobs.
Python Deep-Learning Code¶
In your command line run:
pip install runai
In your python code add:
Initialize Elasticity by calling:
Create a Keras model:
model = runai.elastic.keras.models.Model(model)
model.fit(x_train, y_train, batch_size=runai.elastic.batch_size, epochs=100, validation_data=(x_test, y_test), shuffle=False, verbose=runai.elastic.master, callbacks=[StepTimeReporter()] if runai.elastic.master else )
Running a Training Workload¶
Run the training workload by using the "elastic" flag:
- When launching the job with the runai submit command use --elastic
- When launching a job via YAML code, use the label "elastic" with the value "true"
- Elasticity currently works with Keras-based deep learning code only
- Any training job with Run:AI is subject to pause/resume episodes. Elasticity may increase these episodes, making it even more important to make your code resilient. Save checkpoints in your code and allow it to resume from the latest checkpoint rather than start from the beginning
For additional documentation as well as Python examples see our GitHub repository