Skip to content

Integrate Run:ai with Ray

Ray is an open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert.

Install Ray operator

You must install KubeRay version 0.5.0 or greater in order to work with the different types of Ray workloads.

Use the following commands:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/

helm install kuberay-operator kuberay/kuberay-operator -n kuberay-operator --version 0.5.0 --create-namespace

For more information, see Deploying RayKube operator.

Submit a Ray job

Run:AI integrates with ray by interacting with the kuberay CRDs (RayJob, RayServe and RayCluster). The following is an example of RayJob scheduled by Run:AI. Use the following command to submit your Ray job:

kubectl apply -f <path/example-file-name.yaml>

For more information, see Run an example job.

Example:

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
  name: <name>
  namespace: <your_project_namespace>
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  # runtimeEnv decoded to '{
  #    "pip": [
  #        "requests==2.26.0",
  #        "pendulum==2.1.2"
  #    ],
  #    "env_vars": {
  #        "counter_name": "test_counter"
  #    }
  #}'
  shutdownAfterJobFinishes: true
  #ttlSecondsAfterFinished: 30
  runtimeEnv: ewogICAgInBpcCI6IFsKICAgICAgICAicmVxdWVzdHM9PTIuMjYuMCIsCiAgICAgICAgInBlbmR1bHVtPT0yLjEuMiIKICAgIF0sCiAgICAiZW52X3ZhcnMiOiB7ImNvdW50ZXJfbmFtZSI6ICJ0ZXN0X2NvdW50ZXIifQp9Cg==
  rayClusterSpec:
    rayVersion: '2.3.0' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      serviceType: ClusterIP # optional
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: '2' # can be auto-completed from the limits
        block: 'true'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.3.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          schedulerName: runai-scheduler
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1 # example
        minReplicas: 1 # example
        maxReplicas: 5 # example
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        rayStartParams:
          block: 'true'
        #pod template
        template:
          spec:
            initContainers:
              # the env var $FQ_RAY_IP is set by the operator if missing, with the value of the head service name
              - name: init
                image: busybox:1.28
                command: [ 'sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done" ]
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.3.0
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
            schedulerName: runai-scheduler
######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    print(requests.__version__)

Ray autoscaling cluster

For more information, see Ray autoscaling.

Use the following command to submit your Ray autoscaling cluster:

kubectl apply -f <path/example-file-name.yaml>

Example:

# This config demonstrates KubeRay's Ray autoscaler integration.
# The resource requests and limits in this config are too small for production!
# For an example with more realistic resource configuration, see
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
    # A unique identifier for the head node and workers of this cluster.
  name: <name>
  namespace: <your_project_namespace>
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  rayVersion: '2.3.0'
  # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
  # Ray autoscaler integration is supported only for Ray versions >= 1.11.0
  # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
  enableInTreeAutoscaling: true
  # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
  # The example configuration shown below below represents the DEFAULT values.
  # (You may delete autoscalerOptions if the defaults are suitable.)
  autoscalerOptions:
    # upscalingMode is "Default" or "Aggressive."
    # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
    # Default: Upscaling is not rate-limited.
    # Aggressive: An alias for Default; upscaling is not rate-limited.
    upscalingMode: Default
    # idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
    idleTimeoutSeconds: 60
    # image optionally overrides the autoscaler's container image.
    # If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
    # the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
    ## image: "my-repo/my-custom-autoscaler-image:tag"
    # imagePullPolicy optionally overrides the autoscaler container's default image pull policy (IfNotPresent).
    imagePullPolicy: IfNotPresent
    # Optionally specify the autoscaler container's securityContext.
    securityContext: {}
    env: []
    envFrom: []
    # resources specifies optional resource request and limit overrides for the autoscaler container.
    # The default autoscaler resource limits and requests should be sufficient for production use-cases.
    # However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod template
  headGroupSpec:
    serviceType: ClusterIP # optional
    # the following params are used to complete the ray start: ray start --head --block ...
    rayStartParams:
      dashboard-host: '0.0.0.0'
      block: 'true'
      # num-cpus: '1' # can be auto-completed from the limits
      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the specific format demonstrated below:
      # resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
    #pod template
    template:
      spec:
        containers:
        # The Ray head container
        - name: ray-head
          image: rayproject/ray:2.3.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # The resource requests and limits in this config are too small for production!
          # For an example with more realistic resource configuration, see
          # ray-cluster.autoscaler.large.yaml.
          # It is better to use a few large Ray pod than many small ones.
          # For production, it is ideal to size each Ray pod to take up the
          # entire Kubernetes node on which it is scheduled.
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              # For production use-cases, we recommend specifying integer CPU reqests and limits.
              # We also recommend setting requests equal to limits for both CPU and memory.
              # For this example, we use a 500m CPU request to accomodate resource-constrained local
              # Kubernetes testing environments such as KinD and minikube.
              cpu: "500m"
              # The rest state memory usage of the Ray head node is around 1Gb. We do not
              # recommend allocating less than 2Gb memory for the Ray head pod.
              # For production use-cases, we recommend allocating at least 8Gb memory for each Ray container.
              memory: "2G"
        schedulerName: runai-scheduler
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1 # example
    minReplicas: 1 # example
    maxReplicas: 10 # example
    # logical group name, for this called small-group, also can be functional
    groupName: small-group
    # If worker pods need to be added, we can increment the replicas.
    # If worker pods need to be removed, we decrement the replicas, and populate the workersToDelete list.
    # The operator will remove pods from the list until the desired number of replicas is satisfied.
    # If the difference between the current replica count and the desired replicas is greater than the
    # number of entries in workersToDelete, random worker pods will be deleted.
    #scaleStrategy:
    #  workersToDelete:
    #  - raycluster-complete-worker-small-group-bdtwh
    #  - raycluster-complete-worker-small-group-hv457
    #  - raycluster-complete-worker-small-group-k8tj7
    # the following params are used to complete the ray start: ray start --block ...
    rayStartParams:
      block: 'true'
    #pod template
    template:
      spec:
        initContainers:
        # the env var $FQ_RAY_IP is set by the operator if missing, with the value of the head service name
        - name: init
          image: busybox:1.28
          command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]
        containers:
        - name: ray-worker
          image: rayproject/ray:2.3.0
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          # The resource requests and limits in this config are too small for production!
          # For an example with more realistic resource configuration, see
          # ray-cluster.autoscaler.large.yaml.
          # It is better to use a few large Ray pod than many small ones.
          # For production, it is ideal to size each Ray pod to take up the
          # entire Kubernetes node on which it is scheduled.
          resources:
            limits:
              cpu: "1"
              memory: "1G"
            # For production use-cases, we recommend specifying integer CPU reqests and limits.
            # We also recommend setting requests equal to limits for both CPU and memory.
            # For this example, we use a 500m CPU request to accomodate resource-constrained local
            # Kubernetes testing environments such as KinD and minikube.
            requests:
              cpu: "500m"
              memory: "1G"
        schedulerName: runai-scheduler