Skip to content

Policies V1

Warning

The below describes the old V1 Policies. While these still work, they have been replaced with Control-plane-based v2 policies which are accessible via API and user interface. For a description of the new policies, see API-based Policies.

What are Policies?

Policies allow administrators to impose restrictions and set default values for Researcher Workloads. For example:

  1. Restrict researchers from requesting more than 2 GPUs, or less than 1GB of memory for an interactive workload.
  2. Set the default memory of each training job to 1GB, or mount a default volume to be used by any submitted Workload.

Policies are stored as Kubernetes custom resources.

Policies are specific to Workload type as such there are several kinds of Policies:

Workload Type Kubernetes Workload Name Kubernetes Policy Name
Interactive InteractiveWorkload InteractivePolicy
Training TrainingWorkload TrainingPolicy
Distributed Training DistributedWorkload DistributedPolicy
Inference InferenceWorkload InferencePolicy

A Policy can be created per Run:ai Project (Kubernetes namespace). Additionally, a Policy resource can be created in the runai namespace. This special Policy will take effect when there is no project-specific Policy for the relevant workload kind.

When researchers create a new interactive workload or workspace, they see list of available node pools and their priority. Priority is set by dragging and dropping the node pools in the desired order of priority. When the node pool priority list is locked by an administrator policy, the node pool list isn't editable by the Researcher even if the workspace is created from a template or copied from another workspace.

Note

Policies on this page cannot be added to platform 2.16 or higher that have the New Policy Manager enabled.

Creating a Policy

Creating your First Policy

To create a sample InteractivePolicy, prepare a file (e.g. policy.yaml) containing the following YAML:

gpupolicy.yaml
apiVersion: run.ai/v2alpha1
kind: InteractivePolicy
metadata:
  name: interactive-policy1
  namespace: runai-team-a # (1)
spec:
  gpu:
    rules:
      required: true
      min: "1"  # (2)
      max: "4"  
    value: "1"
  1. Set the Project namespace here.
  2. GPU values are quoted as they can contain non-integer values.

The policy places a default and limit on the available values for GPU allocation. To apply this policy, run:

kubectl apply -f gpupolicy.yaml 

Now, try the following command:

runai submit --gpu 5 --interactive -p team-a

The following message will appear:

gpu: must be no greater than 4

A similar message will appear in the New Job form of the Run:ai user interface, when attempting to enter the number of GPUs, which is out of range for a training job.

GPU and CPU memory limits

The following policy places a default and limit on the available values for CPU and GPU memory allocation.

gpumemorypolicy.yaml
apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
  name: training-policy
  namespace: runai
spec:
  gpuMemory:
    rules:
      min: 100M
      max: 2G
memory:
    rules:
      min: 100M
      max: 2G

Read-only values

When you do not want the user to be able to change a value, you can force the corresponding user interface control to become read-only by using the canEdit key. For example,

runasuserpolicy.yaml
apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
  name: train-policy1
  namespace: runai-team-a # (1) 

spec:
  runAsUser:
    rules:
      required: true  # (2)
      canEdit: false  # (3)
    value: true # (4)
  1. Set the Project namespace here.
  2. The field is required.
  3. The field will be shown as read-only in the user interface.
  4. The field value is true.

Complex Values

The example above illustrated rules for parameters of "primitive" types, such as GPU allocation, CPU memory, working directory, etc. These parameters contain a single value.

Other workload parameters, such as ports or volumes, are "complex", in the sense that they may contain multiple values: a workload may contain multiple ports and multiple volumes.

The following is an example of a policy containing the value ports, which is complex: The ports flag typically contains two values: The external port that is mapped to an internal container port. One can have multiple port tuples defined for a single Workload:

apiVersion: run.ai/v2alpha1
kind: InteractivePolicy
metadata:
  name: interactive-policy
  namespace: runai
spec:
  ports:
    rules:
      canAdd: true
    itemRules:
      container:
        min: 30000
        max: 32767
      external:
        max: 32767
    items:
      admin-port-a:
        rules:
          canRemove: false
          canEdit: false
        value:
          container: 30100
          external: 8080
      admin-port-b:
        value:
          container: 30101
          external: 8081

A policy for a complex field is composed of three parts:

  • Rules: Rules apply to the ports parameter as a whole. In this example, the administrator specifies canAdd rule with true value, indicating that a researcher submitting an interactive job can add additional ports to the ports listed by the policy (true is the default for canAdd, so it actually could have been omitted from the policy above). When canAdd is set to false, the researcher will not be able to add any additional port except those already specified by the policy.
  • itemRules: itemRules impose restrictions on the data members of each item, in this case - container and external. In the above example, the administrator has limited the value of container to 30000-32767, and the value of external to a maximum of 32767.
  • Items: Specifies a list of default ports. Each port is an item in the ports list and given a label (e.g. admin-port-b). The administrator can also specify whether a researcher can change/delete ports from the submitted workload. In the above example, admin-port-a is hardwired and cannot be changed or deleted, while admin-port-b can be changed or deleted by the researcher when submitting the Workload. It is possible to specify a label using the reserved name of DEFAULTS. This item provides the defaults for all other items.

The following is an example of a complex policy for PVCs which contains DEFAULTS.

apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
  name: tp # use your name.
  namespace: runai-team-a # use your namespace
spec:
  pvcs:
    itemRules:
      existingPvc:
        canEdit: false
      claimName:
        required: true
    items:
      DEFAULTS:
        value:
          existingPvc: true
          path: nil

Syntax

The complete syntax of the policy YAML can be obtained using the explain command of kubectl. For example:

kubectl explain trainingpolicy.spec
Should provide the list of all possible fields in the spec of training policies:

KIND:     TrainingPolicy
VERSION:  run.ai/v2alpha1

RESOURCE: spec <Object>

DESCRIPTION:
The specifications of this TrainingPolicy

FIELDS:
annotations <Object>
Specifies annotations to be set in the container running the created
workload.

arguments   <Object>
If set, the arguments are sent along with the command which overrides the
image's entry point of the created workload.

command <Object>
If set, overrides the image's entry point with the supplied command.
...

You can further drill down to get the syntax for ports by running:

kubectl explain trainingpolicy.spec.ports
KIND:     TrainingPolicy
VERSION:  run.ai/v2alpha1

RESOURCE: ports <Object>

DESCRIPTION:
     Specify the set of ports exposed from the container running the created
     workload. Used together with --service-type.

FIELDS:
   itemRules    <Object>

   items    <map[string]Object>

   rules    <Object>
     these rules apply to a value of type map (=non primitive) as a whole
     additionally there are rules which apply for specific items of the map

Drill down into the ports.rules object by running:

kubectl explain trainingpolicy.spec.ports.rules
KIND:     TrainingPolicy
VERSION:  run.ai/

RESOURCE: rules <Object>

DESCRIPTION:
     these rules apply to a value of type map (=non primitive) as a whole
     additionally there are rules which apply for specific items of the map

FIELDS:
   canAdd   <boolean>
     is it allowed for a workload to add items to this map

   required <boolean>
     if the map as a whole is required

Note that each kind of policy has a slightly different set of parameters. For example, an InteractivePolicy has a jupyter parameter that is not available under TrainingPolicy.

Using Secrets for Environment Variables

It is possible to add values from Kubernetes secrets as the value of environment variables included in the policy. The secret will be extracted from the secret object when the Job is created. For example:

  environment:
    items:
      MYPASSWORD:
        value: "SECRET:my-secret,password"

When submitting a workload that is affected by this policy, the created container will have an environment variable called MYPASSWORD whose value is the key password residing in Kubernetes secret my-secret which has been pre-created in the namespace where the workload runs.

Prevent Data Storage on the Node

You can configure policies to prevent the submission of workloads that use data sources that consist of a host path. This setting prevents data from being stored on the node so that in the event when a node is deleted, all data stored on that node is lost.

Example for rejecting workloads requesting host path:

spec:
  volumes:
    itemRules:
      nfsServer:
        required: true

Terminate Run:ai training Jobs after preemption policy

Administrators can set a ‘termination after preemption’ policy to Run:ai training jobs. After applying this policy, a training job will be terminated once it has been preempted from any reason. For example, a training job that is using over-quota resources (e.g. GPUs) and the owner of those GPUs wants to reclaim them back, the Training job is preempted and typically goes back to the pending queue. However, if the termination policy is applied, the job is terminated instead of reinstated as pending. The Termination after Preemption Policy can be set as a cluster-wide policy (applicable to all namespaces/projects) or per project/namespace.

To use this feature the administrator should configure either a cluster wide or namespace policy.

For cluster wide (all namespaces/projects) use this YAML based policy:

apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
  name: training-policy
  namespace: runai
spec:
  terminateAfterPreemption:
    value: true

For per namespace (project) use this YAML based policy:

apiVersion: run.ai/v2alpha1
kind: TrainingPolicy
metadata:
  name: training-policy
  namespace: runai-<PROJECT_NAME>
spec:
  terminateAfterPreemption:
    value: false

Modifying/Deleting Policies

Use the standard kubectl get/apply/delete commands to modify and delete policies.

For example, to view the global interactive policy:

kubectl get interactivepolicies -n runai

Should return the following:

NAME                 AGE
interactive-policy   2d3h

To delete this policy:

kubectl delete InteractivePolicy interactive-policy -n runai

To access project-specific policies, replace the -n runai parameter with the namespace of the relevant project.

See Also