What are Policies?¶
Policies allow administrators to impose restrictions and set default values for Researcher Workloads. For example:
- Restrict researchers from requesting more than 2 GPUs, or less than 1GB of memory for an interactive workload.
- Set the default memory of each training job to 1GB, or mount a default volume to be used by any submitted Workload.
Policies are stored as Kubernetes custom resources.
Policies are specific to Workload type as such there are several kinds of Policies:
|Workload Type||Kubernetes Workload Name||Kubernetes Policy Name|
|Interactive|| || |
|Training|| || |
|Distributed Training|| || |
|Inference|| || |
A Policy can be created per Run:ai Project (Kubernetes namespace). Additionally, a Policy resource can be created in the
runai namespace. This special Policy will take effect when there is no project-specific Policy for the relevant workload kind.
When researchers create a new interactive workload or workspace, they see list of available node pools and their priority. Priority is set by dragging and dropping the node pools in the desired order of priority. When the node pool priority list is locked by an administrator policy, the node pool list isn't editable by the Researcher even if the workspace is created from a template or copied from another workspace.
Creating a Policy¶
Creating your First Policy¶
To create a sample
InteractivePolicy, prepare a file (e.g.
policy.yaml) containing the following YAML:
- Set the Project namespace here.
- GPU values are quoted as they can contain non-integer values.
The policy places a default and limit on the available values for GPU allocation. To apply this policy, run:
When you do not want the user to be able to change a value, you can force the corresponding user interface control to become read-only by using the
canEdit key. For example,
- Set the Project namespace here.
- The field is required.
- The field will be shown as read-only in the user interface.
- The field value is true.
The example above illustrated rules for parameters of "primitive" types, such as GPU allocation, CPU memory, working directory, etc. These parameters contain a single value.
Other workload parameters, such as ports or volumes, are "complex", in the sense that they may contain multiple values: a workload may contain multiple ports and multiple volumes.
The following is an example of a policy containing the value
ports, which is complex: The
ports flag typically contains two values: The
external port that is mapped to an internal
container port. One can have multiple port tuples defined for a single Workload:
apiVersion: run.ai/v2alpha1 kind: InteractivePolicy metadata: name: interactive-policy namespace: runai spec: ports: rules: canAdd: true itemRules: container: min: 30000 max: 32767 external: max: 32767 items: admin-port-a: rules: canRemove: false canEdit: false value: container: 30100 external: 8080 admin-port-b: value: container: 30101 external: 8081
A policy for a complex field is composed of three parts:
- Rules: Rules apply to the
portsparameter as a whole. In this example, the administrator specifies
truevalue, indicating that a researcher submitting an interactive job can add additional ports to the ports listed by the policy (true is the default for
canAdd, so it actually could have been omitted from the policy above). When
canAddis set to
false, the researcher will not be able to add any additional port except those already specified by the policy.
- itemRules: itemRules impose restrictions on the data members of each item, in this case -
external. In the above example, the administrator has limited the value of
containerto 30000-32767, and the value of
externalto a maximum of 32767.
- Items: Specifies a list of default ports. Each port is an item in the ports list and given a label (e.g.
admin-port-b). The administrator can also specify whether a researcher can change/delete ports from the submitted workload. In the above example,
admin-port-ais hardwired and cannot be changed or deleted, while
admin-port-bcan be changed or deleted by the researcher when submitting the Workload. It is possible to specify a label using the reserved name of
DEFAULTS. This item provides the defaults for all other items.
The following is an example of a complex policy for PVCs which contains
The complete syntax of the policy YAML can be obtained using the
explain command of kubectl. For example:
KIND: TrainingPolicy VERSION: run.ai/v2alpha1 RESOURCE: spec <Object> DESCRIPTION: The specifications of this TrainingPolicy FIELDS: annotations <Object> Specifies annotations to be set in the container running the created workload. arguments <Object> If set, the arguments are sent along with the command which overrides the image's entry point of the created workload. command <Object> If set, overrides the image's entry point with the supplied command. ...
You can further drill down to get the syntax for
ports by running:
KIND: TrainingPolicy VERSION: run.ai/v2alpha1 RESOURCE: ports <Object> DESCRIPTION: Specify the set of ports exposed from the container running the created workload. Used together with --service-type. FIELDS: itemRules <Object> items <map[string]Object> rules <Object> these rules apply to a value of type map (=non primitive) as a whole additionally there are rules which apply for specific items of the map
Drill down into the
ports.rules object by running:
KIND: TrainingPolicy VERSION: run.ai/ RESOURCE: rules <Object> DESCRIPTION: these rules apply to a value of type map (=non primitive) as a whole additionally there are rules which apply for specific items of the map FIELDS: canAdd <boolean> is it allowed for a workload to add items to this map required <boolean> if the map as a whole is required
Note that each kind of policy has a slightly different set of parameters. For example, an
InteractivePolicy has a
jupyter parameter that is not available under
Using Secrets for Environment Variables¶
It is possible to add values from Kubernetes secrets as the value of environment variables included in the policy. The secret will be extracted from the secret object when the Job is created. For example:
When submitting a workload that is affected by this policy, the created container will have an environment variable called
MYPASSWORD whose value is the key
password residing in Kubernetes secret
my-secret which has been pre-created in the namespace where the workload runs.
Run:ai provides a secret propagation mechanism from the
runai namespace to all project namespaces. For further information see secret propagation
Terminate Run:ai training Jobs after preemption policy¶
Administrators can set a ‘termination after preemption’ policy to Run:ai training jobs. After applying this policy, a training job will be terminated once it has been preempted from any reason. For example, a training job that is using over-quota resources (e.g. GPUs) and the owner of those GPUs wants to reclaim them back, the Training job is preempted and typically goes back to the pending queue. However, if the termination policy is applied, the job is terminated instead of reinstated as pending. The Termination after Preemption Policy can be set as a cluster-wide policy (applicable to all namespaces/projects) or per project/namespace.
To use this feature the administrator should configure either a cluster wide or namespace policy.
For cluster wide (all namespaces/projects) use this YAML based policy:
For per namespace (project) use this YAML based policy:
Use the standard kubectl get/apply/delete commands to modify and delete policies.
For example, to view the global interactive policy:
Should return the following:
To delete this policy:
To access project-specific policies, replace the
-n runai parameter with the namespace of the relevant project.
- For creating workloads based on policies, see the Run:ai submitting workloads