Skip to content

Disaster recovery

This runbook covers backing up and restoring a plexsphere control plane after a disaster, and the change-freeze invariant that bounds the blast radius. The mesh data plane outlives a control-plane disaster: existing plexds keep running on their last-reconciled state, so a DR event is a change-freeze window, not a mesh outage. Peers stay connected, already-issued sessions keep working to their expiry, and traffic between nodes is unaffected — what pauses is the ability to make new changes through the control plane.

Every durable artefact is backed up, and every plane crossing (Postgres → OpenBao → Management Fleet) has a defined restore order so a restore never leaves a dangling pointer or writes stale state. The baseline below is a starting point; it may be hardened or relaxed per operator SLA.

Coverage

StoreMethodCadenceRetentionPriority
PostgreSQLContinuous WAL archiving to S3-compatible object store + nightly base backups. Point-in-time recovery (PITR) supported.Continuous WAL; base every 24 h.≥ 35 days (compliance) + 7 years cold for audit partitions.P0
OpenBaobao operator raft snapshot to an encrypted, access-controlled bucket independent of the primary blob store.Every 4 h + on rotation events.≥ 35 days. Older snapshots cascade to cold storage.P0
Management-fleet clustersetcd snapshots per cluster; Crossplane XR / Composition manifests also re-derivable from the plexsphere DB via the Provisioning Broker.etcd every 15 min, retained 7 days.30 days of etcd snapshots.P1
Blueprint catalogueExternal to plexsphere — lives in the source Git host (GitHub / GitLab / self-hosted). plexsphere stores the pinned revision per Resource.As per the Git host's own backup.Indefinite.P2
Object storeAWS S3 cross-region replication (SaaS); SeaweedFS with cross-datacenter replication (self-hosted).Continuous.Per-Domain configured (default 90 days for action outputs, 1 year for diagnostics, 7 years for audit archive).P0 — P2 for artefacts, P0 for the audit archive
Grafana MimirMimir's object-store backend (AWS S3 / SeaweedFS via S3 API) is the durability layer; Mimir replicas are stateless and rebuildable from it.Continuous (tenant object-store writes).Per-Domain (default 13 months).P3
Grafana LokiLoki's object-store backend (AWS S3 / SeaweedFS via S3 API) — same as Mimir.Continuous.Per-Domain (logs default 90 days; node-side audit default 7 years).P0 — P2 for logs, P0 for node-side audit
NATS JetStreamRaft-replicated persistent streams with replica factor ≥ 3; min.bytes and max.age tuned per Domain. No external archival — the SSE replay window is bounded and plexd's reconcile-pull covers any loss.In-cluster continuous.≥ 24 h replay window.P2
Configuration-as-codeExternal to plexsphere — Helm values, kustomize overlays, and deployment manifests live in the operator's Git host and are re-applied on restore.As per the Git host's own backup.Indefinite.P2

The priority column grades restore order, from P0 (restore first, nothing else starts healthily without it) down to P3 (gaps self-heal after restore). The two compound cells — object store and Loki — carry a single dominant priority for ordering because their audit-archive obligation is the most critical content they hold.

Recovery objectives

Plane / FunctionRPORTO
Data plane (mesh traffic between plexds)n/a — not affected by plexsphere DRn/a
Control-plane reads≤ 5 min≤ 15 min
Control-plane writes≤ 5 min≤ 30 min
Session-plane issuanceAlready-issued sessions continue to their JWT expiry; new issuance depends on Signing Service restore.≤ 30 min
Observability ingest≤ 15 min (plexd local buffering backfills short gaps)≤ 1 h
Audit archival≤ 1 h≤ 4 h (regulatory reads)

These targets are per region in a SaaS deployment; cross-region failover is a separate procedure with its own, weaker RTO.

Restore checklist

Run these steps in order. Running them out of order leaves dangling pointers (Postgres rows that reference OpenBao paths that are not yet restored) or writes stale state.

  1. OpenBao. Restore the most recent Raft snapshot into a fresh OpenBao cluster and unseal via the configured auto-unseal root. Verify: seals are healthy and known paths read back (platform-level wrap keys, a sample Cloud Credential).
  2. PostgreSQL. Restore to the latest consistent recovery point (latest base backup + WAL up to the chosen target time); the core schema and the spicedb logical database restore in one operation. Verify: core and spicedb databases are present; if OpenBao was restored to an earlier snapshot, Postgres is rewound to that snapshot's wall-clock time.
  3. Signing Service. Attach the HSM / KMS slot holding the relevant signing keys; if the original HSM is lost, rotate via the transition-window mechanism. Verify: the signing slot is reachable and serves the active key.
  4. NATS JetStream. Bring the NATS cluster back; replay data older than the retention window is lost but handled by reconcile-pull. Verify: the cluster is up; no client-side action is required.
  5. Management-fleet clusters. Restore etcd from the most recent snapshot per cluster, or rebuild clusters fresh; Crossplane + ESO reconcile the namespaces for every assigned Project. Verify: ProviderConfigs, ExternalSecrets, and XRs are re-emitted by the Provisioning Broker.
  6. SpiceDB. Start SpiceDB replicas against the restored spicedb datastore. Verify: schema read shows the schema and a sanity check against a known tuple passes; SpiceDB is green before the core binary passes its readiness probe.
  7. Core plexsphere binary + Signing Service replicas. Start with the readiness probe gating on the above dependencies. Verify: plexds reconnect the SSE stream and run reconcile-pull; peer deltas, policy, and secret deliveries catch up automatically.
  8. Observability backends. Bring up Grafana Mimir and Grafana Loki last. Verify: metrics and logs ingest resume; a plexsphere that serves writes without metrics is operational.

Seal-key custody

OpenBao's seal key is the single point that, if lost, renders every plexsphere secret unreadable. Treat its custody as the most sensitive part of the DR plan.

  • Steady-state auto-unseal. In normal operation OpenBao auto-unseals via a cloud KMS (AWS KMS / GCP Cloud KMS / Azure Key Vault), so no human handles the seal key during a routine restart.
  • Shamir break-glass. The recovery key is split with Shamir's scheme: five shares held by distinct operators, three required to reconstruct, each on physically segregated media. No single operator may unseal alone.
  • Drills. Run a restore drill every quarter into an isolated environment using the Shamir path, so the break-glass procedure is exercised before it is needed in anger.

Snapshot-force caveat. A bao operator raft snapshot restore -force writes the source cluster's encryption keyring into the target cluster. The restored cluster must therefore be unsealed with the source snapshot's seal/unseal key, not the target cluster's own key. Unsealing with the wrong key leaves the restored cluster permanently sealed — custody of the source key must travel with the snapshot.

Regional failover

Failover is a promotion, not a live migration. Domains and Projects are region-pinned at creation, so each region runs its own core replicas, Signing Service, pub/sub, Postgres primary, OpenBao cluster, and management-fleet cluster. Bringing a region online means promoting its standby, not streaming live traffic across region boundaries.

  • Warm standby. Postgres and OpenBao replicate to the standby region asynchronously, giving a cross-region RPO of roughly ≤ 5 min for Postgres and ≤ 15 min for OpenBao snapshots.
  • Promotion. The standby region brings up an empty management cluster and re-applies plexsphere state for the promoted Projects; Crossplane and ESO then reconcile their namespaces.
  • Residency opt-out. Residency-restricted Domains can opt out of cross-region standby and run in-region DR only, keeping their data within a single region's boundary.

For the deployment model behind region pinning and per-region ingress, see the multi-region runbook. The per-region overlays live under deploy/local/overlays/region-reference/.

What plexsphere does not back up

Some state is deliberately not backed up because it is reconstructible or short-lived. Each item below names its recovery path.

  1. Plexd agent state on nodes — plexd is deterministic given plexsphere's source of truth; after a control-plane restore it reconciles back to consistency via SSE replay and reconcile-pull.
  2. Customer-side secret stores — customer Vault / AWS Secrets Manager used for Adopted-K8s bootstrap-token delivery belong to the customer; subject to the customer's own DR plan.
  3. WireGuard keys on plexd nodes — mesh keys are rotated on reconnect when needed rather than restored; re-established via rotate_keys events on reconnect.
  4. Bootstrap-token plaintext — bootstrap tokens are short-TTL and one-time-use; only their hashes are persisted server-side; mint a fresh bootstrap token for any pending enrolment.
  5. Transient SSE state — the signed SSE replay window is bounded and intentionally not archived; plexd reconcile-pull reconstructs current state after the window is lost.

Automation reference

This branch ships reference automation that exercises the restore sequence and integrity end to end:

  • Reference overlay. A four-hourly OpenBao raft-snapshot CronJob and a nightly Postgres-dump CronJob live under deploy/local/overlays/dr-reference/. This is a reference overlay — it is not applied by make dev, and the dev OpenBao runs -dev/in-memory so it cannot snapshot. Treat it as a worked example to adapt, not a turnkey production deployment.
  • CLI. The plexsphere-backup CLI exposes the catalog, sequence, put, get, openbao-snapshot, openbao-restore, and verify subcommands for inspecting the coverage matrix and driving snapshot/restore round-trips.
  • Read-only inspection surfaces. The same coverage matrix and ordered restore plan are also exposed read-only through the platform-scoped GET /v1/platform/backup/catalog and GET /v1/platform/backup/restore-plan operations, and over the plexctl backup catalog and plexctl restore plan command surfaces — so an operator can read the catalogue and the restore sequence without the plexsphere-backup binary on the box. These surfaces are inspection-only: they project the plan, they do not stream artefacts or drive a restore.
  • Tests. The DR integration suite (tests/integration/dr_*) and the regional-failover e2e suite (tests/e2e/dr/) prove the sequence and integrity against real Postgres and OpenBao round-trips.

DECISION: the repo's automation proves the restore sequence and integrity via logical pg_dumpall + raft-snapshot round-trips, and deliberately does not encode production-grade point-in-time recovery (continuous WAL archiving). Rejected alternative: ship a pinned WAL-archiver deployment (a configured archive command + base-backup + recovery-target tooling) in this reference overlay. It was rejected because continuous WAL archiving is deployment-specific — the object store, retention policy, and recovery-target tooling are operator choices — so a pinned archiver would either hard-code one operator's substrate (misleading every other deployment) or carry so many placeholders it stops being runnable. The portable, verifiable contract the automation can prove is the restore ordering and integrity; PITR wiring stays documented as an operator responsibility above.