Skip to content

Failure modes and degradation

This runbook describes how a plexsphere API replica behaves when one of its backing dependencies is briefly unavailable, and how to detect, interpret, and recover from each case. The governing principle is "plexsphere down ≠ mesh down": a transient SpiceDB, Postgres, NATS, OpenBao, or Signing-Service outage must degrade a single surface, not pull every replica out of the Kubernetes Service and convert a partial outage into a total control-plane outage.

Two design rules make that hold:

  1. Boot-gate-then-degrade readiness. A dependency probe gates a fresh replica out of rotation until that dependency is first reachable, but once it has succeeded once it never fails the replica again — a later outage is reported as informational and the replica keeps serving every surface that does not depend on the failed dependency.
  2. Structured degradation, never a generic 5xx. Every dependency outage surfaces as an RFC 9457 problem document with a stable, machine-readable code, so a client can branch on the code (back off, fail over, page) instead of string-matching opaque 500s. The underlying error never reaches the response body — the cause is in the server's structured log record only, and the problem detail is a fixed, generic, retry-oriented string.

The readiness model

GET /readyz aggregates the registered probes into one of three top-level states:

statusHTTPMeaning
ready200Every probe passes.
degraded200Every non-latched probe passes; one or more boot-latched probes failed after their first success. The replica stays in the Service.
unhealthy503A non-latched probe failed, or a boot-latched probe has not yet succeeded once. The replica is pulled from the Service.

A boot-latched probe is registered with the platform health registry's boot-latch option. Before its first success it gates hard (/readyz returns 503 unhealthy); after its first success a later failure is reported with "ok": false, "informational": true in that probe's entry, and the top-level status drops to degraded (still HTTP 200) rather than unhealthy.

The boot-latched dependency probes on a freshly booted dev stack are:

Probe nameDependency
db-primaryPostgres (primary datastore)
jetstreamNATS JetStream (signed event bus)
spicedb-readinessSpiceDB (ReBAC engine)
secrets-reconcileOpenBao (secret backend)

No probe is registered for the Signing Service. The signer is dialed only on the issuance hot path; its health is alertable from the issuance error counters and the signing_unavailable problem code rather than from a boot probe — see Signing Service below.

Each probe's pass/fail state is also published as a Prometheus gauge, health_probe_up{probe="<name>"}, which reads 1 while the probe passes and 0 while it fails (including the post-latch informational case). Alert on health_probe_up == 0 for any probe to surface a degraded dependency, and join it with the per-dependency counters below to scope the blast radius.

curl http://<replica>:8080/readyz on a healthy replica lists entries for db-primary, jetstream, spicedb-readiness, and secrets-reconcile (among the non-dependency probes); the absence of one of those four is itself a wiring regression worth investigating.

SpiceDB (authorization)

SpiceDB is the ReBAC engine every gated operation consults.

Detection. health_probe_up{probe="spicedb-readiness"} drops to 0 and that probe reports informational on /readyz. Any ReBAC-gated operation that cannot be served from the stale cache returns HTTP 503 with problem code authz_unavailable (never a generic 500 — no handler fails open). The stale-serve counter authz_stale_check_served_total{outcome="allow|deny"} and the gauge authz_stale_check_cache_size show how much load the last-known-good cache is absorbing during the outage.

Degraded behaviour. A read-mostly Check whose (subject, relation, object) tuple is in the last-known-good cache, was cached within SPICEDB_CHECK_STALE_TTL, and carries an empty caveat context is served from cache and audited with the distinct reason stale_cache. Three guards keep this safe:

  • Caveated checks (a non-empty caveat context — e.g. a time- or CIDR-sensitive grant) are never served stale, because a stale decision could silently become a fail-open on a condition that has since changed.
  • Write, Delete, LookupResources, and LookupSubjects never consult the cache — they return authz_unavailable immediately, so a mutation never proceeds against a stale view.
  • The cache is TTL-bounded and capped at SPICEDB_CHECK_STALE_MAX_ENTRIES. Once full, warming a new triple evicts the oldest cached decision (a least-recently-inserted drop), so the cache size stays bounded and authz_stale_check_cache_size holds at the cap. Evicting the oldest keeps the freshest decisions available during an outage; either way an evicted triple simply misses and surfaces authz_unavailable, never a fail-open.

Tuning. SPICEDB_CHECK_STALE_TTL (default 30s; 0 disables the stale cache entirely) bounds how old a served decision may be. SPICEDB_CHECK_STALE_MAX_ENTRIES (default 10000) bounds the cache size. Shorten the TTL to tighten the staleness window at the cost of more authz_unavailable responses during an outage; lengthen it to ride out longer SpiceDB blips at the cost of a longer stale window.

Recovery. When SpiceDB returns, the next fresh Check repopulates the cache and authz_stale_check_served_total stops incrementing; health_probe_up{probe="spicedb-readiness"} returns to 1 and /readyz returns to ready.

Postgres (datastore / node state)

Postgres is the primary datastore. The read-mostly node-state path survives a brief outage via a per-node stale-snapshot cache; write funnels do not.

Detection. health_probe_up{probe="db-primary"} drops to 0 and that probe reports informational on /readyz (which stays HTTP 200 degraded). GET /v1/nodes/{id}/state for a Node with no fresh-enough cached snapshot returns HTTP 503 with problem codedatastore_unavailable. The counter state_stale_snapshot_served_total and the gauge state_stale_snapshot_cache_size show how many reads are being served from cache.

Degraded behaviour. A Node whose last successful snapshot is cached and fresher than PLEXSPHERE_STATE_STALE_TTL keeps getting HTTP 200 with that snapshot, so an already-enrolled node keeps its desired state instead of seeing a generic 500. A cold or expired cache entry returns 503 datastore_unavailable rather than a stale serve. Writes keep returning 5xx during a Postgres outage by design — the contract for write funnels is only that they fail loudly, not that they degrade to a structured outage code.

Tuning. PLEXSPHERE_STATE_STALE_TTL (default 5m; 0 disables the stale cache) bounds how old a served snapshot may be. PLEXSPHERE_STATE_STALE_MAX_ENTRIES (default 10000) caps the cache (evicts the oldest cached snapshot when full, same as the authz cache).

Recovery. When Postgres returns, the next successful snapshot read refreshes the cache, state_stale_snapshot_served_total stops incrementing, and fresh (non-cached) snapshots resume.

NATS JetStream (signed event bus)

JetStream backs the Server-Sent-Events stream that pushes signed state deltas to node agents. The state pull path does not depend on it.

Detection. health_probe_up{probe="jetstream"} drops to 0 and that probe reports informational on /readyz (HTTP 200 degraded). A fresh GET /v1/nodes/{id}/events subscribe returns HTTP 503 with problem code event_stream_unavailable and a fixed generic detail — the underlying subscribe error is in the server log record only, never in the response body.

Degraded behaviour. While JetStream is down, the SSE subscribe fails fast with event_stream_unavailable, but GET /v1/nodes/{id}/state keeps returning HTTP 200 — the state path reads the snapshot and never touches JetStream, so a node agent that polls /state keeps its desired configuration. "plexsphere down ≠ mesh down" holds: existing peer connections and already-pushed deltas are unaffected.

Recovery. When NATS returns, the messaging client reconnects, a fresh subscribe succeeds, and the relay drains any outbox rows that accumulated during the outage so replay resumes.

OpenBao (secret backend)

OpenBao stores Cloud Credentials and Project secrets. A sealed or unreachable backend degrades the secret-read edge and stalls the credential sweepers.

Detection. A node-secret read against a sealed or unreachable OpenBao returns HTTP 503 with problem code openbao_unavailable. The credential-broker sweep ticks that are blocked by the outage increment secretstore_unavailable_errors_total{consumer="<broker>"} and each logs a WARN naming the affected subsystem and the operator action. After the secrets-reconcile probe's first success, an OpenBao outage leaves /readyz at HTTP 200 (the probe reports informational), so a sealed backend does not pull the replica out of rotation.

Degraded behaviour. New secret reads fail with openbao_unavailable, so Cloud Credential materialisation into Kubernetes Secrets and Project secret reads from nodes stall and new provisioning cannot start. Already-delivered, NSK-wrapped secrets on nodes remain usable — the ciphertext is local to plexd — and already-issued sessions are unaffected.

Divergence note. The project README's degradation table uses the illustrative string secret-backend-unavailable for this case. The shipped wire code is openbao_unavailable, and the wire code is authoritative: renaming it would be a breaking change for zero behavioural gain, so the divergence is documented here and in the OpenAPI Problem.code description rather than reconciled. Operators and clients branch on openbao_unavailable.

Recovery. Unseal (or restore reachability to) OpenBao; the next secret read succeeds, secretstore_unavailable_errors_total stops incrementing, and the credential sweepers resume on their next tick. See the secrets context for the secret-read contract.

Signing Service

The Signing Service mints session tokens and signs SSE event frames. Unlike the four dependencies above it has no boot probe — it is dialed only on the issuance and publish hot paths.

Detection. POST /v1/projects/{project_id}/sessions during a signer outage returns HTTP 503 with problem code signing_unavailable; no Session is persisted and no token is issued, so a retry after recovery is safe (idempotent). On the publish side, the SSE relay increments two counters that are the load-bearing alerting signals for a signer outage:

  • sse_publisher_sign_failures_total{reason} — a publish attempt could not sign an event frame.
  • sse_relay_skip_total{reason} — the relay loop skipped an outbox row (e.g. reason="no_signing_key") and left it unacked for retry.

Alert on a sustained increase in either counter together with the signing_unavailable 503 rate, rather than on a probe. The signer outage deliberately does not raise a tenancy integrity_alert outbox event — that aggregate is node-agent-scoped (binary/hook checksums, SSH host keys), and operator paging for a control-plane signer outage maps to these metrics and the structured 503 instead.

Degraded behaviour. New session issuance and new signed-event publishes fail; already-issued sessions keep working to their expiry, and the relay cursor does not advance past an unsigned row, so no event is lost.

Recovery. When the signer returns, issuance succeeds on retry and the relay drains the rows it skipped. See the SSE context for the publish/relay contract and the observability ingest context for the back-pressure path that shares the same structured-degradation philosophy.

Quick reference

DependencyProbehealth_probe_up{probe=…}Problem codeDegradation metric(s)Tuning knob(s)
SpiceDBspicedb-readinessspicedb-readinessauthz_unavailableauthz_stale_check_served_total, authz_stale_check_cache_sizeSPICEDB_CHECK_STALE_TTL, SPICEDB_CHECK_STALE_MAX_ENTRIES
Postgresdb-primarydb-primarydatastore_unavailablestate_stale_snapshot_served_total, state_stale_snapshot_cache_sizePLEXSPHERE_STATE_STALE_TTL, PLEXSPHERE_STATE_STALE_MAX_ENTRIES
NATS JetStreamjetstreamjetstreamevent_stream_unavailable(probe gauge)
OpenBaosecrets-reconcilesecrets-reconcileopenbao_unavailablesecretstore_unavailable_errors_total
Signing Service(none)signing_unavailablesse_publisher_sign_failures_total, sse_relay_skip_total

See also

  • Disaster recovery — restoring the control plane after a full disaster (a different, larger-blast-radius event than a single dependency blip).
  • Secrets context — the node-secret read contract behind openbao_unavailable.
  • SSE context — the signed event bus, the relay loop, and the signer alerting signals.
  • Observability ingest context — the ingest back-pressure path that shares the structured-degradation contract.