Appearance
Capacity and scale targets
This runbook covers the per-Domain capacity-and-scale collector: the six scale dimensions plexsphere tracks, their Phase-1 target defaults, how the collector samples usage and computes a used / target ratio, the 80% crossing audit contract, the structured refusal a hard ceiling produces, and the load-test harness operators use to validate a target against a real deployment.
The numbers below are design-phase orientation figures, not SLAs. They size a single deployment on the HA minimums, are operator-overridable defaults, and Phase-1 load-testing may move individual targets up or down. Exceeding a target is an operator-visible signal — never a silent failure — surfaced through the Dashboard capacity view, the Platform Audit Log, and Prometheus.
The six dimensions
The collector tracks six axes per Domain. Each has a unit, a Phase-1 default target, and an owning enforcement point that defines what the number means.
| Dimension | Unit | Phase-1 default target | Owning enforcement point |
|---|---|---|---|
nodes | count | 10 000 | Identity & Registration — sustained enrolled-plexd ceiling per Domain (bursts to 15 000 during mass rollouts are out of scope for the sustained target). |
sse_fanout | events/sec | 1 000 | Signed Event Bus (mesh SSE) — sustained SSE fan-out per Domain. |
secret_reads | reads/sec | 10 000 | Secret Store — cluster-wide sustained secret-read rate per Domain (NSK-rewrap dominated). |
mediated_sessions | count | 500 | Access Orchestrator — concurrent mediated sessions per Domain across all kinds (ssh + k8s + tcp). |
observability_ingest | bytes/sec | 5 MiB/s (5 242 880 bytes/s) | Observability Ingest — aggregated per-Domain byte budget across metrics + logs + audit. |
action_executions | count | 1 000 | Action Orchestrator — concurrent action executions per Domain. |
Ingest unit note. The project README states the ingest target in prose as "5 MB/sec", but the collector tracks it as 5 MiB/s (5 × 1024 × 1024 = 5 242 880 wire bytes/sec, binary) to stay consistent with the gzip-compressed wire-byte counters the Observability Ingest quota debits. A 1000-based reading would drift roughly 4.6% below the enforced cap, so the binary figure is authoritative.
A target of 0 means "no target configured": the collector skips the ratio for that dimension rather than dividing by zero.
How the collector works
A platform background collector samples each Domain's usage on a fixed interval (default 15 seconds), computes a used / target ratio per dimension, and publishes the result as a per-Domain snapshot.
There are two source kinds:
- Level sources (
nodes,mediated_sessions,action_executions): an absolute tally read directly each tick via per-DomainCOUNT(*) … GROUP BY domain_idqueries over the live rows (enrolled nodes / live access sessions / live executions). - Counter sources (
sse_fanout,secret_reads,observability_ingest): a sustained rate computed as the delta of a cumulative counter divided by the elapsed interval. The first sample reports rate0, because there is no prior delta to subtract.
The snapshot is per-Domain. Until the first sample for a Domain completes, that Domain's snapshot is unavailable and the read endpoint returns HTTP 503 with a Retry-After header; the wait shrinks as you tighten the sample interval (see the knob below).
The 80% crossing audit contract
Crossing detection is edge-triggered with hysteresis: a crossing fires exactly once when a dimension's ratio reaches 0.80 from below, and re-arms only after the ratio drops back below 0.75. The 0.05 deadband keeps a ratio hovering at the threshold from flapping. A restart re-fires one entry per still-crossed dimension, because crossing state is intentionally not persisted.
Each crossing writes one row to the Platform Audit Log on the addressed Domain's hash chain:
- Subject:
system:capacity-monitor— a synthetic system principal, because the collector has no human caller. - Object:
domain:<domain-uuid>. - Reason:
granted. - Relation: one of these six exact strings, one per dimension:
capacity.nodes.threshold_crossedcapacity.sse_fanout.threshold_crossedcapacity.secret_reads.threshold_crossedcapacity.mediated_sessions.threshold_crossedcapacity.observability_ingest.threshold_crossedcapacity.action_executions.threshold_crossed
The ratio value itself is deliberately not carried on the audit row: the audit CaveatContext is a names-only contract. The audit row records that a Domain crossed 80%; the plexsphere_capacity_ratio gauge below records by how much. Read the two together — the audit log for the discrete crossing event, the gauge for the live magnitude.
The capacity_exceeded refusal
There is no silent throttling. When a per-Domain hard ceiling is hit, the offending request receives an RFC 9457application/problem+json response with HTTP 429, a code of capacity_exceeded, an optional dimension member naming the scale axis, and a Retry-After header so automation backs off cleanly:
json
{
"type": "https://plexsphere.dev/errors/capacity-exceeded",
"title": "Too Many Requests",
"status": 429,
"detail": "the per-Domain observability ingest budget is exhausted; retry after the window resets.",
"code": "capacity_exceeded",
"dimension": "observability_ingest"
}Today the dimension member is set by exactly the two dimensions that enforce a hard ceiling at request time:
observability_ingest— the per-Domain ingest byte budget.action_executions— the per-Domain live-execution cap.
The other four dimensions are observed-and-audited but not refused at this surface: a node, session, secret-read, or SSE-fan-out overshoot shows up in the Dashboard ratio and the 80% audit crossing, not as a capacity_exceeded response.
Metrics
The collector exposes these Prometheus series:
plexsphere_capacity_used{domain_id,dimension}— the absolute tally or sustained rate from the latest sample.plexsphere_capacity_ratio{domain_id,dimension}—used / target; the live magnitude behind the 80% audit crossing.plexsphere_capacity_target{dimension}— the configured target (Domain-independent label set).plexsphere_capacity_crossings_total{domain_id,dimension}— count of 80% crossings fired.plexsphere_capacity_crossing_record_failures_total— count of audit writes that failed when recording a crossing; a sustained increase means crossings are happening but not landing on the hash chain, so alert on it.
Alert on plexsphere_capacity_ratio approaching 1.0 per dimension, and join it with plexsphere_capacity_crossings_total to scope which Domains are pushing a ceiling.
Tuning the sample interval
PLEXSPHERE_CAPACITY_SAMPLE_INTERVAL is a Go-duration env var (e.g. 30s, 1m) that overrides the default 15-second sample cadence. An empty or unset value keeps the default.
- Tighten it (shorter than 15 s) to make the snapshot fresher and shorten the
Retry-Afterwindow before a Domain's first sample lands, at the cost of more sampling load. - Loosen it (longer than 15 s) to reduce sampling load on Postgres and the counters, at the cost of a staler ratio and a longer wait before the snapshot is first available.
Runbook: driving a load test
make load-test drives one capacity dimension via the tests/load harness against a provisioned deployment. Select the axis with DIMENSION= — one of nodes, sse-fanout, secret-reads, sessions, ingest, or actions (default ingest).
Two variables are required and have no safe default:
LOAD_DOMAIN_ID=<domain-uuid>— the Domain to drive.LOAD_TOKEN=<bearer-token>— a bearer token authorised for that Domain.
The optional variables are LOAD_BASE_URL (default http://localhost:8080), LOAD_RATE (default 100), LOAD_DURATION (default 30s), and LOAD_RAMP (default 5s).
shell
make load-test DIMENSION=ingest LOAD_DOMAIN_ID=<uuid> LOAD_TOKEN=<token>
make load-test DIMENSION=actions LOAD_DOMAIN_ID=<uuid> LOAD_TOKEN=<token>Reading the result. The driver reports p50/p95/p99 latencies and buckets every response by its RFC 9457 Problem code. It exits non-zero if the target rate is not sustained, or if a refusal code outside this expected set appears:
capacity_exceededper_node_rate_limitedper_domain_rate_limitedsession_limit_exceeded
Those four refusals are the system correctly defending its ceilings under load — not harness failures. Any other 4xx or 5xx code is a real regression and fails the run.
Not a CI gate. Full-scale load runs are deliberately not a blocking CI job: there is no in-pipeline provisioned deployment to drive, and a sustained-rate run is open-ended by design. Only the harness's own unit tests under tests/load run in make test; make load-test is operator-invoked against a real target.
Why these targets
The six numbers are design-phase orientation figures for a single deployment on the HA minimums, chosen so evaluators can judge order-of-magnitude fit for their fleet rather than as contractual SLAs. Phase-1 load-testing — driven by the harness above — may move individual targets up or down. The model is deliberately layered: exceeding a target is always an operator-visible signal (the Dashboard capacity view ratio plus the 80% audit crossing), but only the two hard-limited dimensions produce a structured capacity_exceeded refusal. Observation and refusal are separate contracts: everything is watched, only ceilings are enforced at the request edge.
See also
- Capacity HTTP API reference — the
GET /v1/domains/{domainId}/capacitysnapshot endpoint, its schema, and the 503-before-first-sample contract. - The OpenAPI specification at
../../api/openapi/plexsphere-v1.yaml— the wire contract for the snapshot operation and thecapacity_exceededProblem shape. plexctl metrics queryand the Metrics and Logs query context — query the per-Domain metric series behind these capacity ratios over the read-onlyGET /v1/domains/{domainId}/metrics/queryproxy, which bounds every query and injects the addressed Domain as the upstream tenant server-side.- Failure modes and degradation — the degradation cousin of this runbook: how a single dependency outage surfaces a structured problem code instead of a generic 5xx.
- Multi-region runbook — region pinning and per-region ingress; capacity targets are per-Domain and therefore per-region.
- Disaster recovery — restoring the control plane after a larger-blast-radius event.
The Dashboard capacity view renders the live used / target ratio per dimension per Domain; it is the first place an operator sees a Domain approaching a target.