Job Statuses

Introduction¶

The runai submit function and its sibling the runai submit-dist mpi function submit Run:ai Jobs for execution.

A Job has a status. Once a Job is submitted it goes through several statuses before ending in an End State. Most of these statuses originate in the underlying Kubernetes infrastructure, but some are Run:ai-specific.

The purpose of this document is to explain these statuses as well as the lifecycle of a Job.

Successful Flow¶

A regular, training Job that has no errors and executes without preemption would go through the following statuses:

Pending - the Job is waiting to be scheduled.
ContainerCreating - the Job has been scheduled, the Job docker image is now downloading.
Running - the Job is now executing.
Succeeded - the Job has finished with exit code 0 (success).

The Job can be preempted, in which case it can go through other statuses:

Terminating - the Job is now being preempted.
Pending - the Job is waiting in queue again to receive resources.

An interactive Job, by definition, needs to be closed by the Researcher and will thus never reach the Succeeded status. Rather, it would be moved by the Researcher to status Deleted.

For a further explanation of the additional statuses, see the table below.

Error flow¶

A regular, training Job may encounter an error inside the running process (exit code is non-zero). In which case the following will happen:

The Job enters an Error status and then immediately tries to reschedule itself for another attempted run. The reschedule can happen on another node in the system. After a specified number of retries, the Job will enter a final status of Fail

An interactive Job, enters an Error status and then moves immediately to CrashLoopBackOff trying to reschedule itself. The reschedule attempt has no 'back-off' limit and will continue to retry indefinitely

Jobs may be submitted with an image that cannot be downloaded. There are special statuses for such Jobs. See table below

Status Table¶

Below is a list of statuses. For each status the list shows:

Name
End State - this status is the final status in the lifecycle of the Job
Resource Allocation - when the Job is in this status, does the system allocate resources to it
Description
Color - Status color as can be seen in the Run:ai User Interface Job list

Status	End State	Resource Allocation	Description	Color
Running		Yes	Job is running successfully
Terminating		Yes	Pod is being evicted at the moment (e.g. due to an over-quota allocation, the reason will be written once eviction finishes). A new pod will be created shortly
ContainerCreating		Yes	Image is being pulled from registry.
Pending		-	Job is pending. Possible reasons: - Not enough resources - Waiting in Queue (over-quota etc).
Succeeded	Yes	-	An Unattended (training) Job has ran and finished successfully.
Deleted	Yes	-	Job has been deleted.
TimedOut	Yes	-	Interactive Job has reached the defined timeout of the project.
Preempted	Yes	-	Interactive preemptible Job has been evicted.
ContainerCannotRun	Yes	-	Container has failed to start running. This is typically a problem within the docker image itself.
Error		Yes for interactive only	The Job has returned an exit code different than zero. It is now waiting for another run attempt (retry).
Fail	Yes	-	Job has failed after a number of retries (according to "--backoffLimit" field) and will not be trying again.
CrashLoopBackOff		Yes	Interactive Only: During backoff after Error, before a retry attempt to run pod on the same node.
ErrImagePull, ImagePullBackOff		Yes	Failing to retrieve docker image
Unknown	Yes	-	The Run:ai Scheduler wasn't running when the Job has finished.

How to get more information¶

The system stores various events during the Job's lifecycle. These events can be helpful in diagnosing issues around Job scheduling. To view these events run:

runai describe job <workload-name>

Sometimes, useful information can be found by looking at logs emitted from the process running inside the container. For example, Jobs that have exited with an exit code different than zero may write an exit reason in this log. To see Job logs run:

runai logs <job-name>

Distributed Training (mpi) Jobs¶

A distributed (mpi) Job, which has no errors will be slightly more complicated and has additional statuses associated with it.

Distributed Jobs start with an "init container" which sets the stage for a distributed run.
When the init container finishes, the main "launcher" container is created. The launcher is responsible for coordinating between the different workers
Workers run and do the actual work.

A successful flow of distribute training would look as:

Additional Statuses:

Status	End State	Resource Allocation	Description	Color
Init:<number A>/<number B>		Yes	The Pod has B Init Containers, and A have completed so far.
PodInitializing		Yes	The pod has finished executing Init Containers. The system is creating the main 'launcher' container
Init:Error			An Init Container has failed to execute.
Init:CrashLoopBackOff			An Init Container has failed repeatedly to execute