Submit a Distributed Training (MPI) Run:AI Job for execution.
To use distributed training you need to have installed the Kubeflow MPI Operator as specified here
runai submit-mpi [--always-pull-image] [--attach] [--backoff-limit int] [--command] [--cpu double] [--cpu-limit double] [--create-home-dir] [--environment stringArray | -e stringArray] [--git-sync string] [--gpu double | -g double] [--gpu-memory string] [--host-ipc] [--host-network] [--image string | -i string] [--interactive] [--job-name-prefix string] [--large-shm] [--local-image] [--memory string] [--memory-limit string] [--name string] [--node-type string] [--prevent-privilege-escalation] [--processes int] [--pvc [StorageClassName]:Size:ContainerMountPath:[ro]] [--run-as-user] [--stdin] [--template string] [--tty | -t] [--volume stringArray | -v stringArray] [--nfs-server string] [--working-dir] [--loglevel string] [--project string | -p string] [--help | -h] -- [COMMAND] [ARGS...] [options]
- Options with a value type of stringArray mean that you can add multiple values. You can either separate values with a comma or add the flag twice.
start an unattended mpi training Job of name dist1, based on Project team-a using a quickstart-distributed image:
runai submit-mpi --name dist1 --processes=2 -g 1 \ -i gcr.io/run-ai-demo/quickstart-distributed
(see: distributed training Quickstart).
Aliases and Shortcuts¶
The name of the Job.
Mark this Job as Interactive. Interactive Jobs are not terminated automatically by the system.
Provide the name of a template. A template can provide default and mandatory values.
The prefix to use to automatically generate a Job name with an incremental index. When a Job name is omitted Run:AI will generate a Job name. The optional
--job-name-prefix flagcreates Job names with the provided prefix.
--always-pull-image stringArray (deprecated)
Deprecated. Please use
image-pull-policy=alwaysinstead. When starting a container, always pull the image from the registry, even if the image is cached on the running node. This is useful when you are re-saving updates to the image using the same tag but may incur a penalty of performance degradation on Job start.
Default is false. If set to true, wait for the Pod to start running. When the pod starts running, attach to the Pod. The flag is equivalent to the command runai attach.
The --attach flag also sets
If set, overrides the image's entry point with the command supplied after '--'
--command script.py --args 10000
-e stringArray | --environment stringArray
Define environment variables to be set in the container. To set multiple values add the flag multiple times (
-e BATCH_SIZE=50 -e LEARNING_RATE=0.2).
Clone a git repository into the container running the Job. The parameter should follow the syntax:
Note that source=REPOSITORY is the only mandatory field
--image string | -i string
Image to use when creating the container for this Job
Pulling policy of the image When starting a container. Options are:
always(default): force image pulling to check whether local image already exists. If the image already exists locally and has the same digest, then the image will not be downloaded.
ifNotPresent: the image is pulled only if it is not already present locally.
never: the image is assumed to exist locally. No attempt is made to pull the image.
For more information see Kubernetes documentation.
Deprecated. Please use
image-pull-policy=neverinstead. Use a local image for this Job. A local image is an image that exists on all local servers of the Kubernetes Cluster.
Keep stdin open for the container(s) in the pod, even if nothing is attached.
Allocate a pseudo-TTY
Starts the container with the specified directory as the current directory.
CPU units to allocate for the Job (0.5, 1, .etc). The Job will receive at least this amount of CPU. Note that the Job will not be scheduled unless the system can guarantee this amount of CPUs to the Job.
Limitations on the number of CPUs consumed by the Job (0.5, 1, .etc). The system guarantees that this Job will not be able to consume more than this amount of CPUs.
--gpu double | -g double
Number of GPUs to allocate for the Job. The default is no allocated GPUs. the GPU value can be an integer or a fraction between 0 and 1.
GPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of GPU memory to the Job.
Mount a large /dev/shm device. An shm is a shared file system mounted on RAM.
CPU memory to allocate for this Job (1G, 20M, .etc). The Job will receive at least this amount of memory. Note that the Job will not be scheduled unless the system can guarantee this amount of memory to the Job.
CPU memory to allocate for this Job (1G, 20M, .etc). The system guarantees that this Job will not be able to consume more than this amount of memory. The Job will receive an error when trying to allocate more memory than this limit.
Mount a persistent volume claim into a container.
The 2 syntax types of this command are mutually exclusive. You can either use the first or second form, but not a mixture of both.
Storage_Class_Name is a storage class name that can be obtained by running
kubectl get storageclasses.storage.k8s.io. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class.
Size is the volume size you want to allocate. See Kubernetes documentation for how to specify volume sizes
Container_Mount_Path. A path internal to the container where the storage will be mounted
Pvc_Name. The name of a pre-existing Persistent Volume Claim to mount into the container
--pvc :3Gi:/tmp/john:ro- Allocate
3GBfrom the default Storage class. Mount it to
--pvc my-storage:3Gi:/tmp/john:ro- Allocate
my-storagestorage class. Mount it to /tmp/john as read-only
--pvc :3Gi:/tmp/john- Allocate
3GBfrom the default storage class. Mount it to
--pvc my-pvc:/tmp/john- Use a Persistent Volume Claim named
my-pvc. Mount it to
--pvc my-pvc-2:/tmp/john:ro- Use a Persistent Volume Claim named
my-pvc-2. Mount it to
Volumes to mount into the container.
-v /raid/public/john/data:/root/data:roMount /root/data to local path /raid/public/john/data for read-only access.
-v /public/data:/root/data::nfs.example.comMount /root/data to NFS path /public/data on NFS server nfs.example.com for read-write access.
Use this flag to specify a default NFS host for --volume flag. Alternatively, you can specify NFS host for each volume individually (see --volume for details).
Use the host's ipc namespace. Controls whether the pod containers can share the host IPC namespace. IPC (POSIX/SysV IPC) namespace provides separation of named shared memory segments, semaphores, and message queues. Shared memory segments are used to accelerate inter-process communication at memory speed, rather than through pipes or the network stack.
For further information see docker run reference documentation.
Use the host's network stack inside the container. For further information see docker run referencedocumentation.
The number of times the Job will be retried before failing. The default is 6. This flag will only work with training workloads (when the
--interactiveflag is not specified).
Number of distributed training processes. The default is 1.
Create a temporary home directory for the user in the container. Data saved in this directory will not be saved when the container exits. The flag is set by default to true when the --run-as-user flag is used, and false if not. For more information see non root containers.
Prevent the Job’s container and all launched processes from gaining additional privileges after the Job starts. Default is
false. For more information see non root containers.
Run in the context of the current user running the Run:AI command rather than the root user. While the default container user is root (same as in Docker), this command allows you to submit a Job running under your Linux user. This would manifest itself in access to operating system resources, in the owner of new folders created under shared directories, etc. Alternatively, if your cluster is connected to Run:AI via SAML, you can map the container to use the Linux UID/GID which is stored in the organization's directory. For more information see non root containers.
Allows defining specific Nodes (machines) or a group of Nodes on which the workload will run. To use this feature your Administrator will need to label nodes as explained here: Limit a Workload to a Specific Node Group. This flag can be used in conjunction with Project-based affinity. In this case, the flag is used to refine the list of allowable node groups set in the Project. For more information see: Working with Projects.
Set the logging level. One of: debug | info | warn | error (default "info")
--project | -p (string)
Specify the Project to which the command applies. Run:AI Projects are used by the scheduler to calculate resource eligibility. By default, commands apply to the default Project. To change the default Project use
runai config project <project-name>.
--help | -h
Show help text.
The command will attempt to submit an mpi Job. You can follow up on the Job by running
runai list jobs or
runai describe job <job-name>.
- See Quickstart document Running Distributed Training.