Skip to content

Planning for Disaster Recovery

The SaaS version of Run:ai moves the bulk of the burden of disaster recovery to Run:ai. Backup of data is hence not an issue in such environments.

With the self-hosted version, it is the responsibility of the IT organization to backup data for a possible disaster and to learn how to recover when needed.

Backup

Database

Run:ai uses an internal PostgreSQL database. The database is stored on a Kubernetes Persistent Volume (PV). You must provide a backup solution for the database.

Alternatives:

  • (Recommended) Back up the PV.
  • Use the company's enterprise PostgreSQL solution if exists, instead of the in-place instance that Run:ai spawns.

Metrics

Run:ai stores metric history using Thanos. Thanos is configured to store data on a persistent volume. The recommendation is to back up the PV.

Additional Configuration

During the installation of Run:ai you have created two value files,

You will want to save these files, or extract a current version of the file by using the upgrade script.

Administrators may also create templates. Templates are stored as ConfigMaps in the runai namespace.

Recovery

To recover Run:ai

  • Re-create the Kubernetes/OpenShift cluster.
  • Recover the persistent volumes for metrics and database.
  • Re-install the Run:ai control plane. Use the stored values file. If needed, modify the values file to connect to the restored PostgreSQL PV. Connect Prometheus to the stored metrics PV.
  • Re-install the cluster. Use the stored values file or download a new file from the Administration UI.
  • If the cluster is configured such that Projects do not create namespace automatically, you will need to re-create namespaces and apply role bindings as discussed in kubernetes or OpenShift.

Last update: March 23, 2022