Planning for Disaster Recovery¶
The SaaS version of Run:ai moves the bulk of the burden of disaster recovery to Run:ai. Backup of data is hence not an issue in such environments.
With the self-hosted version, it is the responsibility of the IT organization to back up data for a possible disaster and to learn how to recover when needed.
Backup¶
Database¶
Run:ai uses an internal PostgreSQL database. The database is stored on a Kubernetes Persistent Volume (PV). You must provide a backup solution for the database. Typically by backing up the persistent volume holding the database storage.
Metrics¶
Run:ai stores metric history using Thanos. Thanos is configured to store data on a persistent volume. The recommendation is to back up the PV.
Additional Configuration¶
During the installation of Run:ai you have created two value files:
- One for the Run:ai control plane. See Kubernetes or OpenShift,
- One for the cluster (see Kubernetes or OpenShift).
You will want to save these files or extract a current version of the file by using the upgrade script.
Recovery¶
To recover Run:ai
- Re-create the Kubernetes/OpenShift cluster.
- Recover the persistent volumes for metrics and database.
- Re-install the Run:ai control plane. Use the stored values file. If needed, modify the values file to connect to the restored PostgreSQL PV. Connect Prometheus to the stored metrics PV.
- Re-install the cluster. Use the stored values file or download a new file from the Administration UI.
- If the cluster is configured such that Projects do not create a namespace automatically, you will need to re-create namespaces and apply role bindings as discussed in Kubernetes or OpenShift.