Configure Policies¶
What are Policies?¶
Policies allow administrators to impose restrictions and set default values for Researcher Workloads. For example:
- Restrict researchers from requesting more than 2 GPUs, or less than 1GB of memory for an interactive workload.
- Set the default memory of each training job to 1GB, or mount a default volume to be used by any submitted Workload.
Policies are stored as Kubernetes custom resources.
Policies are specific to Workload type as such there are several kinds of Policies:
Workload Type | Kubernetes Workload Name | Kubernetes Policy Name |
---|---|---|
Interactive | InteractiveWorkload | InteractivePolicy |
Training | TrainingWorkload | TrainingPolicy |
Distributed Training | DistributedWorkload | DistributedPolicy |
Inference | InferenceWorkload | InferencePolicy |
A Policy can be created per Run:ai Project (Kubernetes namespace). Additionally, a Policy resource can be created in the runai
namespace. This special Policy will take effect when there is no project-specific Policy for the relevant workload kind.
When researchers create a new interactive workload or workspace, they see list of available node pools and their priority. Priority is set by dragging and dropping the node pools in the desired order of priority. When the node pool priority list is locked by an administrator policy, the node pool list isn't editable by the Researcher even if the workspace is created from a template or copied from another workspace.
Creating a Policy¶
Creating your First Policy¶
To create a sample InteractivePolicy
, prepare a file (e.g. policy.yaml
) containing the following YAML:
apiVersion: run.ai/v2alpha1
kind: InteractivePolicy
metadata:
name: interactive-policy1
namespace: runai-team-a # (1)
spec:
gpu:
rules:
required: true
min: "1" # (2)
max: "4"
value: "1"
- Set the Project namespace here.
- GPU values are quoted as they can contain non-integer values.
The policy places a default and limit on the available values for GPU allocation. To apply this policy, run:
Now, try the following command: The following message will appear: A similar message will appear in the New Job form of the Run:ai user interface, when attempting to enter the number of GPUs, which is out of range for an Interactive tab.Read-only values¶
When you do not want the user to be able to change a value, you can force the corresponding user interface control to become read-only by using the canEdit
key. For example,
apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
name: train-policy1
namespace: runai-team-a # (1)
spec:
runAsUser:
rules:
required: true # (2)
canEdit: false # (3)
value: true # (4)
- Set the Project namespace here.
- The field is required.
- The field will be shown as read-only in the user interface.
- The field value is true.
Complex Values¶
The example above illustrated rules for parameters of "primitive" types, such as GPU allocation, CPU memory, working directory, etc. These parameters contain a single value.
Other workload parameters, such as ports or volumes, are "complex", in the sense that they may contain multiple values: a workload may contain multiple ports and multiple volumes.
The following is an example of a policy containing the value ports
, which is complex: The ports
flag typically contains two values: The external
port that is mapped to an internal container
port. One can have multiple port tuples defined for a single Workload:
apiVersion: run.ai/v2alpha1
kind: InteractivePolicy
metadata:
name: interactive-policy
namespace: runai
spec:
ports:
rules:
canAdd: true
itemRules:
container:
min: 30000
max: 32767
external:
max: 32767
items:
admin-port-a:
rules:
canRemove: false
canEdit: false
value:
container: 30100
external: 8080
admin-port-b:
value:
container: 30101
external: 8081
A policy for a complex field is composed of three parts:
- Rules: Rules apply to the
ports
parameter as a whole. In this example, the administrator specifiescanAdd
rule withtrue
value, indicating that a researcher submitting an interactive job can add additional ports to the ports listed by the policy (true is the default forcanAdd
, so it actually could have been omitted from the policy above). WhencanAdd
is set tofalse
, the researcher will not be able to add any additional port except those already specified by the policy. - itemRules: itemRules impose restrictions on the data members of each item, in this case -
container
andexternal
. In the above example, the administrator has limited the value ofcontainer
to 30000-32767, and the value ofexternal
to a maximum of 32767. - Items: Specifies a list of default ports. Each port is an item in the ports list and given a label (e.g.
admin-port-b
). The administrator can also specify whether a researcher can change/delete ports from the submitted workload. In the above example,admin-port-a
is hardwired and cannot be changed or deleted, whileadmin-port-b
can be changed or deleted by the researcher when submitting the Workload. It is possible to specify a label using the reserved name ofDEFAULTS
. This item provides the defaults for all other items.
The following is an example of a complex policy for PVCs which contains DEFAULTS
.
apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
name: tp # use your name.
namespace: runai-team-a # use your namespace
spec:
pvcs:
itemRules:
existingPvc:
canEdit: false
claimName:
required: true
items:
DEFAULTS:
value:
existingPvc: true
path: nil
Syntax¶
The complete syntax of the policy YAML can be obtained using the explain
command of kubectl. For example:
KIND: TrainingPolicy
VERSION: run.ai/v2alpha1
RESOURCE: spec <Object>
DESCRIPTION:
The specifications of this TrainingPolicy
FIELDS:
annotations <Object>
Specifies annotations to be set in the container running the created
workload.
arguments <Object>
If set, the arguments are sent along with the command which overrides the
image's entry point of the created workload.
command <Object>
If set, overrides the image's entry point with the supplied command.
...
You can further drill down to get the syntax for ports
by running:
KIND: TrainingPolicy
VERSION: run.ai/v2alpha1
RESOURCE: ports <Object>
DESCRIPTION:
Specify the set of ports exposed from the container running the created
workload. Used together with --service-type.
FIELDS:
itemRules <Object>
items <map[string]Object>
rules <Object>
these rules apply to a value of type map (=non primitive) as a whole
additionally there are rules which apply for specific items of the map
Drill down into the ports.rules
object by running:
KIND: TrainingPolicy
VERSION: run.ai/
RESOURCE: rules <Object>
DESCRIPTION:
these rules apply to a value of type map (=non primitive) as a whole
additionally there are rules which apply for specific items of the map
FIELDS:
canAdd <boolean>
is it allowed for a workload to add items to this map
required <boolean>
if the map as a whole is required
Note that each kind of policy has a slightly different set of parameters. For example, an InteractivePolicy
has a jupyter
parameter that is not available under TrainingPolicy
.
Using Secrets for Environment Variables¶
It is possible to add values from Kubernetes secrets as the value of environment variables included in the policy. The secret will be extracted from the secret object when the Job is created. For example:
When submitting a workload that is affected by this policy, the created container will have an environment variable called MYPASSWORD
whose value is the key password
residing in Kubernetes secret my-secret
which has been pre-created in the namespace where the workload runs.
Note
Run:ai provides a secret propagation mechanism from the runai
namespace to all project namespaces. For further information see secret propagation
Terminate Run:ai training Jobs after preemption policy¶
Administrators can set a ‘termination after preemption’ policy to Run:ai training jobs. After applying this policy, a training job will be terminated once it has been preempted from any reason. For example, a training job that is using over-quota resources (e.g. GPUs) and the owner of those GPUs wants to reclaim them back, the Training job is preempted and typically goes back to the pending queue. However, if the termination policy is applied, the job is terminated instead of reinstated as pending. The Termination after Preemption Policy can be set as a cluster-wide policy (applicable to all namespaces/projects) or per project/namespace.
To use this feature the administrator should configure either a cluster wide or namespace policy.
For cluster wide (all namespaces/projects) use this YAML based policy:
apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
name: training-policy
namespace: runai
spec:
terminateAfterPreemption:
value: true
For per namespace (project) use this YAML based policy:
apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
name: training-policy
namespace: runai-<PROJECT_NAME>
spec:
terminateAfterPreemption:
value: false
Modifying/Deleting Policies¶
Use the standard kubectl get/apply/delete commands to modify and delete policies.
For example, to view the global interactive policy:
Should return the following:
To delete this policy:
To access project-specific policies, replace the -n runai
parameter with the namespace of the relevant project.
See Also¶
- For creating workloads based on policies, see the Run:ai submitting workloads