Skip to content

Cluster Monitoring

Introduction

Organizations typically want to automatically highlight critical issues and escalate issues to IT/DevOps personnel. The standard practice is to install an alert management tool and connect it to critical systems.

Run:AI is comprised of two parts:

  • A control plane part, The control-plane (or "backend"):
  • Typically resides on the cloud. The health of the cloud portion of Run:AI can be viewed at status.run.ai.
  • In Self-hosted installations of Run:AI is installed on-prem.
  • One or more GPU Clusters.

The purpose of this document is to configure the Run:AI to emit health alerts and to connect these alerts to alert-management systems within the organization.

Alerts are emitted for Run:AI Clusters as well as the Run:AI backend on Self-hosted installation where the backend resides on the same Kubernetes cluster as one of the Run:AI clusters.

Alert Infrastructure

Run:AI uses Prometheus for externalizing metrics. The Run:AI cluster installation installs Prometheus or can connect to an existing Prometheus instance used in the organization. Run:AI cluster alerts are based on the Prometheus Alert Manager. The Prometheus Alert Manager is enabled by default.

This document explains, how to:

  • Configure alert destinations. Triggered alerts will send data to destinations.
  • Understand the out-of-the-box cluster alerts.
  • Advanced: add additional custom alerts.

Configure Alert Destinations

Prometheus Alert Manager provides a structured way to connect to alert-management systems. Configuration details are here. There are built-in plug-ins for popular systems such as PagerDuty and OpsGenie, including a generic webhook.

Following is an example showing how to integrate Run:AI to a webhook:

kube-prometheus-stack:
  ...
  alertmanager:
    enabled: true
    config:
      global:
        resolve_timeout: 5m
      receivers:
      - name: "null"
      - name: webhook-notifications
        webhook_configs:
          - url: <WEB-HOOK-URL>
            send_resolved: true
      route:
        group_by:
        - alertname
        group_interval: 5m
        group_wait: 30s
        receiver: 'null'
        repeat_interval: 10m
        routes:
        - receiver: webhook-notifications

(replace <WEB-HOOK-URL> with the above).

  • On an existing installation, use the upgrade cluster instructions to modify the values file.
  • Verify that you have received alerts at https://webhook.site/.

Out-of-the-box Alerts

A Run:AI cluster comes with several built-in alerts. Each alert tests a specific aspect of the Run:AI functionality. In addition, there is a single, inclusive alert, which aggregates all component-based alerts into a single cluster health test

The aggregated alert is named RunaiCriticalProblem. It is categorized as "critical".

Add a custom alert

You can add additional alerts from Run:AI. Alerts are triggered by using the Promtheus query language with any Run:AI metric. To add new alert:

  • When installing the Run:AI cluster, edit the values file.
  • On an existing installation, use the upgrade cluster instructions to modify the values file.
  • Add an alert according to the following structure:

Add more alerts with the following structure:

kube-prometheus-stack:
  additionalPrometheusRulesMap:
    custom-runai:
      groups:
      - name: custom-runai-rules
        rules:
        - alert: <ALERT-NAME>
          annotations:
            summary: <ALERT-SUMMARY-TEXT>
          expr:  <PROMQL-EXPRESSION>
          for: <optional: duration s/m/h>>
          labels:
            severity: <critical/warning>

You can find an example in the Prometheus documentation here.


Last update: November 10, 2021