Appearance
Cloud Credentials Custodian
This is the authoritative bounded-context reference for the Cloud Credentials Custodian that ships under internal/provisioning/cloudcredentials/. The custodian is the sub-context of plexsphere provisioning that owns the lifecycle of cloud-scoped secret material: issuance, rotation, revocation, and TTL-driven expiry. The Postgres broker row is the durable inventory; the secret bytes themselves live in OpenBao KV-v2 at a path the custodian derives deterministically from (cloud_id, cloud_credential_id). Every aggregate state change appends a typed domain event to the shared platform outbox inside the same Postgres transaction as the broker-row mutation, preserving the at-least-once delivery contract used elsewhere.
The custodian has no HTTP surface of its own. Callers reach it through the in-process Custodian facade; the assignment of a Cloud Credential to one or more Projects (the request/approve workflow) lives in the sibling credentialassignment sub-context, which owns its own HTTP surface. The closed Custodian port is the contract that lets every emitter remain unaware of which adapter is wired (internal/provisioning/cloudcredentials/ports.go).
Pages
This bounded-context reference is intentionally a single page. The custodian's surface area is narrower than identity or audit, and the ports / events / sentinels travel in lockstep — splitting them across files would force a reader to chase the same DECISION block through three pages. Cross-cutting code anchors are linked inline from each section.
Cross-references
../../contributing/layout.md— the bounded-context map row that locatesinternal/provisioning/cloudcredentialsinside the codebase and enumerates the depguard rules that keep this module free of cross-context imports../credentials.md— the sibling OpenBao Credential Broker bounded context. The cloud-credentials custodian mirrors the broker's shape (Custodian / Materialiser / Repository / Sweeper) but is platform-scoped on Cloud rather than per-Project per-Domain. The DECISION block oninternal/provisioning/cloudcredentials/doc.goenumerates the aggregate-keying trade-off../rebac.md— the Credential Assignment sub-context and its ReBAC contract. The request/approve/reject/revoke workflow that binds a cloud credential to a Project lives there; this custodian owns the credential aggregate itself, not its assignments.../identity/tenancy.md— theDomain → Project → Resource → Nodeaggregate model. The cloud credentials custodian deliberately does NOT pivot on Project or Domain: it is platform-scoped on Cloud at the current milestone.../../reference/provisioning/cloud-credential-pool.md— the per-port API reference: field tables forIssueInput/RotateInput/RevokeInput/Material/CloudCredentialRow, the error-sentinel table, and the outbox-event schema table.../../../internal/provisioning/cloudcredentials/doc.go— package-level pin of the ubiquitous language and the depguard rationale, including the DECISION block on why the aggregate is keyed on(cloud_id, credential_id)and platform-scoped.../../../internal/platform/db/migrations/0022_cloud_credentials.sql— the persistence schema forplexsphere.cloud_credentialandplexsphere.cloud_credential_outbox_token.../../../tests/e2e/provisioning/credential-pool-rotation/chainsaw-test.yaml— Chainsaw e2e suite that issues + rotates a cloud credential against a real Postgres + OpenBao stack and asserts the(kv_version, version)pair advances and the outbox emits exactly onecloudcredentials.CloudCredentialRotatedrow.
Ubiquitous language
The terms below travel together across the Go code, the SQL migration, the outbox event payloads, the structured-log attributes, and operator-facing tooling. Names are preserved verbatim in error messages and outbox payloads so a reader chasing a string from a log line finds it in the source without translation.
| Term | Definition | Code anchor |
|---|---|---|
| CloudCredential | The aggregate root. One row in plexsphere.cloud_credential carrying (cloud_credential_id, cloud_id, display_name, kv_mount, kv_path, version, kv_version, expires_at, revoked_at, expired_at, created_at, updated_at). The shape is frozen — the custodian changes no field without a numbered migration. | internal/provisioning/cloudcredentials/ports.go, 0022_cloud_credentials.sql |
| CloudCredentialID | The 16-byte UUIDv7 primary key of the broker row. The String() projection is the canonical 8-4-4-4-12 hyphenated form so log lines from the custodian join cleanly with sibling contexts. The zero value is rejected by every aggregate invariant that requires a concrete reference. | internal/provisioning/cloudcredentials/ports.go |
| Material | The opaque secret value the caller hands the custodian on Issue and Rotate. Carries Payload []byte, a flat KeyValues map[string]string, and the custodian's TTL time.Duration from which expires_at is derived. Caller-owned bytes — the custodian MUST defensively copy both before any persistence call so subsequent caller mutation cannot reach the stored row. The type's String() method returns a redacted descriptor so accidental %v / %+v / slog formatting cannot leak secret bytes into a log surface. | internal/provisioning/cloudcredentials/ports.go |
| Custodian | The in-process facade through which application services interact with the cloud-credentials pool. Methods: Issue, Rotate, Revoke, Lookup. Implementations orchestrate the Repository, Materialiser, audit, and outbox in a single Postgres transaction; callers branch on the package's sentinel errors via errors.Is. | internal/provisioning/cloudcredentials/ports.go |
| Materialiser | The KV-v2 adapter port. Put writes data at /<mount>/data/<path> with a CAS expectation and returns the new KV-v2 version. Delete soft-deletes the secret. DerivePath returns the deterministic (mount, path) pair so the custodian, the audit log, and the Sweeper all agree on where a credential lives. The default in-package adapter fails closed (ErrMaterialiserUnavailable) on Put/Delete; the OpenBao-backed adapter ships under the cloudcredentials_openbao build tag. | internal/provisioning/cloudcredentials/materialiser/materialiser.go |
| Repository | The persistence port. Carries Create, FindByID, RotateCAS, Revoke, ListExpired, MarkExpired, AppendOutboxEvent, RecordOutboxToken, FindOutboxToken, and RunInTx. The Postgres adapter is a thin wrapper over the sqlc-generated queries; constraint-name dispatch maps SQLSTATE 23505 collisions and SQLSTATE 23503 FK violations onto the canonical sentinels. | internal/provisioning/cloudcredentials/ports.go, internal/provisioning/cloudcredentials/repo/credentials_pg.go |
| Sweeper | The steady-state TTL expiry worker. Run(ctx) walks Repository.ListExpired in pages, applies MarkExpired + AppendOutboxEvent + RecordOutboxToken per row inside one RunInTx, and flips the /readyz readiness flag on its first clean return. Boot-time Run gates /readyz; subsequent ticks honour Config.SweepInterval. | internal/provisioning/cloudcredentials/sweeper/sweeper.go |
| AuditSink | The audit-emission port the Custodian writes through (Record(ctx, entry) error). The shape mirrors internal/audit.Sink so the composition root wires a thin shim that translates AuditEntry into audit.Entry without pulling the audit package into this module. Failures are counted (AuditSinkFailuresTotal) and logged after-commit, not re-raised — the broker row + outbox event are durable by the time the audit hand-off runs. | internal/provisioning/cloudcredentials/ports.go, internal/provisioning/cloudcredentials/custodian.go |
| broker-row version | The CAS counter on the plexsphere.cloud_credential row. Incremented by every application-side mutation (issue / rotate / revoke / mark-expired). Repository writers pass an expected_version precondition; a mismatch returns ErrBrokerCASConflict rather than silently overwriting. | 0022_cloud_credentials.sql, E0_cloud_credentials.sql |
| kv_version | The OpenBao KV-v2 store's own monotonic per-path counter. Surfaces in bao kv metadata output. The custodian mirrors the post-write value on the broker row so operators can correlate the broker row against KV-v2 metadata without round-tripping to OpenBao. Distinct from the broker-row version because the remediation differs (see Error sentinels). | 0022_cloud_credentials.sql |
| lifecycle stamps | expires_at is the wall-clock instant the credential ceases to be valid. revoked_at is non-NULL once the operator has issued a soft-delete. expired_at is non-NULL once the Sweeper has observed the credential's expiry. The Sweeper's eligibility predicate is revoked_at IS NULL AND expired_at IS NULL AND expires_at <= now() and is backed by a partial index. A SQL CHECK gates the revoked_at / expired_at exclusivity. | 0022_cloud_credentials.sql |
| cloud_credential_outbox_token | The sibling table holding one row per (cloud_credential_id, event_type) pair under a PRIMARY KEY. Structural at-most-once guarantee for every custodian lifecycle event: a retried Revoke or a Sweeper-issued MarkExpired never appends a second CloudCredentialRevoked / CloudCredentialExpired row to the outbox. The event_type column carries a snake_case discriminator from the migration's CHECK allow-list (cloud_credential_issued / _rotated / _revoked / _expired); the in-Go mapping lives on events.TokenEventType. | 0022_cloud_credentials.sql, internal/provisioning/cloudcredentials/events/events.go |
Aggregates
The CloudCredential aggregate is the only aggregate this context owns. The aggregate root is the plexsphere.cloud_credential row; the plexsphere.cloud_credential_outbox_token row is a structural at-most-once token, not a domain object — it carries no value the domain reasons about beyond the (cloud_credential_id, event_type) identity.
CloudCredential — invariants
The aggregate enforces six invariants that travel across the SQL schema, the Repository adapter, and the Custodian application service:
| Invariant | Layer | Failure mode |
|---|---|---|
(kv_mount, kv_path) is unique. | SQL UNIQUE constraint cloud_credential_kv_path_unique. | ErrPathAlreadyMaterialised — a chosen-credential-id collision against an existing Cloud's path. |
cloud_credential_id is unique. | SQL PRIMARY KEY. | ErrCloudCredentialAlreadyExists — a re-Issue with mismatched fields. |
cloud_id references an existing row in plexsphere.clouds(id). | SQL FOREIGN KEY with ON DELETE RESTRICT. | ErrCloudNotFound on an absent FK target; a Cloud delete observes CloudNotEmptyError while at least one non-expired credential references it. |
revoked_at IS NULL OR expired_at IS NULL. | SQL CHECK. | The CHECK fires on the corner case where both stamps are non-null; in normal operation the application transitions through one terminal stamp. |
kv_path is the deterministic projection of (cloud_id, cloud_credential_id). | Application service (Materialiser.DerivePath) + persistence (broker row INSERT). | Drift between the two surfaces produces a chain-of-custody mismatch the Sweeper would detect on the next pass. |
| Issue / Rotate / Revoke / MarkExpired runs inside one Postgres transaction with the matching outbox append and token record. | Application service (Custodian + RunInTx). | A partial commit would either leak a KV-v2 row without a broker row (ErrIssueAtomicityViolated) or commit a broker row without an outbox event (the at-least-once contract would degrade to no-shot). |
CloudCredential — state machine
The CloudCredential is a three-stage terminal aggregate. Once the row enters either terminal stamp (revoked_at or expired_at) no further rotation is possible — operators re-issue rather than resurrect.
text
┌──────────────────────────────────────────────────────────┐
│ Issue │
│ · Mint CloudCredentialID (UUIDv7) │
│ · Materialiser.DerivePath → (mount, path) │
│ · Materialiser.Put (KV-v2 CAS=0 — first write) │
│ · INSERT plexsphere.cloud_credential │
│ · AppendOutboxEvent (CloudCredentialIssued) │
│ · RecordOutboxToken (cloud_credential_issued) │
│ · compensating Materialiser.Delete on tx rollback │
└──────────────────┬───────────────────────────────────────┘
│ version = 1, kv_version = 1
▼
┌──────────────────────────────────────────────────────────┐
│ Active │
│ · Rotate → version' = version + 1 │
│ kv_version' = kv_version + 1 │
│ · Lookup → SELECT row │
└────────┬───────────────────────────────────┬─────────────┘
│ Revoke │ Sweeper.Run + expires_at ≤ now
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Revoked (terminal) │ │ Expired (terminal) │
│ revoked_at != NULL│ │ expired_at != NULL│
│ CloudCredentialRev│ │ CloudCredentialExp│
└─────────────────────┘ └─────────────────────┘Ports
The cloudcredentials package declares four ports (internal/provisioning/cloudcredentials/ports.go) plus an AuditSinkPort alias declared in the Sweeper. Each port is the narrow seam through which the bounded context reaches its collaborators; the depguard no-cross-context-imports rule keeps the cloudcredentials module free of direct imports of internal/identity, internal/audit, or internal/platform/secretstore.
| Port | Methods | Adapter | Test seam |
|---|---|---|---|
Custodian | Issue, Rotate, Revoke, Lookup | concrete *custodian in custodian.go; wired into the production binary by cmd/plexsphere/cloudcredentials_factory_prod.go (composition root). NewCustodian refuses construction on any nil load-bearing port (Repository / Materialiser / AuditSink) — operator-misconfigured composition fails at build time, not on the first Issue. | unit-level fakes in custodian_test.go cover Issue / Rotate / Revoke / Lookup including the compensating-delete and CAS-conflict branches; integration-level tests at tests/integration/cloudcredentials_pool_*_test.go drive the production Custodian against a real Postgres + OpenBao stack via the testcontainers fixtures. |
Materialiser | Put, Delete, DerivePath | default fail-closed stub in materialiser/materialiser.go; real OpenBao adapter under //go:build cloudcredentials_openbao in materialiser/openbao/openbao.go. The OpenBao adapter wraps *secretstore.Client.KVPut / KVDelete and translates SDK CAS-mismatch wording onto cloudcredentials.ErrKVStoreCASConflict. | DerivePath is pure logic — the unit tests exercise it without a fake (TestDerivePathRejectsZeroUUID, TestDerivePathFormat); the build-tagged adapter has its own unit tests at materialiser/openbao/openbao_test.go; integration coverage runs against a real OpenBao container. |
Repository | Create, FindByID, RotateCAS, Revoke, ListExpired, MarkExpired, AppendOutboxEvent, RecordOutboxToken, FindOutboxToken, RunInTx | Postgres in repo/credentials_pg.go, wrapping the sqlc-generated Queries from E0_cloud_credentials.sql. Constraint-name dispatch maps cloud_credential_pkey to ErrCloudCredentialAlreadyExists, cloud_credential_kv_path_unique to ErrPathAlreadyMaterialised, and a 23503 FK violation on cloud_id to ErrCloudNotFound. | the unit tests inject a fake sqlc layer that asserts each constraint-name dispatch path; the build-tagged repo/credentials_pg_test.go integration tests pin Create idempotency, RotateCAS happy + ErrBrokerCASConflict, Revoke idempotency, and ListExpired pagination. |
AuditSink | Record(ctx, AuditEntry) → error | wired via a one-method pass-through shim in the composition root that translates cloudcredentials.AuditEntry into audit.Entry. The Sweeper holds its own AuditSinkPort alias with the same shape (sweeper.AuditEntry / sweeper.AuditSinkPort) so the sub-package stays free of an internal/audit import. | the Custodian unit tests use an in-memory recording sink; the integration tests reuse the same sink shape for emission-ordering assertions. |
Sweeper | Run | concrete *Sweeper in sweeper/sweeper.go; ProbeFunc is the /readyz mount, exported through the composition root as CloudCredentialsWiring.SweepProbe. | unit tests use a fixed Clock and an in-memory Repository; the boot-probe + ticker contracts are pinned by tests/integration/cloudcredentials_pool_sweeper_test.go and the app-level boot test at cmd/plexsphere/app_cloudcredentials_wiring_test.go. |
The Clock port is a convenience seam used by both the Custodian and the Sweeper; it lives in the same ports.go file so a reviewer auditing the bounded-context surface finds every external dependency in one place. There is no DomainResolver port — the CloudCredential aggregate is platform-scoped on (cloud_id, credential_id) (see DECISION block in doc.go).
Outbox events
The custodian emits four typed domain events to plexsphere.outbox_events inside the same Postgres transaction as the broker-row mutation. Event types are written to the event_type column verbatim — the string form is part of the wire contract once a row has been emitted. Each event row carries aggregate_type = "cloud_credential" so the relay routes cloud-credentials rows distinctly from sibling contexts.
| Event type (column value) | Trigger | Payload struct |
|---|---|---|
cloudcredentials.CloudCredentialIssued | Custodian.Issue succeeds. | events.CloudCredentialIssued |
cloudcredentials.CloudCredentialRotated | Custodian.Rotate succeeds (CAS-protected). | events.CloudCredentialRotated |
cloudcredentials.CloudCredentialRevoked | Custodian.Revoke succeeds. Carries the operator-supplied Reason. | events.CloudCredentialRevoked |
cloudcredentials.CloudCredentialExpired | Sweeper.Run observes a row with expires_at <= now() AND revoked_at IS NULL AND expired_at IS NULL. | events.CloudCredentialExpired |
Every payload carries a UUIDv7 EventID, a UTC OccurredAt timestamp, and the relevant credential identity. The event_type set is closed — adding a fifth value is a breaking schema change, not a switch-statement extension. The package-local drift gate TestEventTypesAreClosedSet enforces the four-event allow-list. Field-level shapes are pinned in the reference page.
At-most-once via cloud_credential_outbox_token
The sibling plexsphere.cloud_credential_outbox_token table holds one row per (cloud_credential_id, event_type) pair under a PRIMARY KEY. Each custodian transition (issued / rotated / revoked / expired) MUST emit the corresponding outbox event at-most-once per credential. The PRIMARY KEY is the smallest invariant that closes the door at the storage layer: a retried Revoke or a Sweeper-issued MarkExpired's INSERT trips a unique violation, the application catches it and skips the outbox append while leaving the broker row in its terminal state. This is the structural guarantee that the at-least-once outbox contract degrades to exactly-once for every custodian lifecycle event — see the file-header DECISION block on 0022_cloud_credentials.sql for the full rationale.
The event_type column on the token table is a snake_case discriminator gated by a CHECK allow-list (cloud_credential_issued / _rotated / _revoked / _expired). The in-Go mapping from the outbox event_type literal to the token-table discriminator lives on events.TokenEventType; the mapping is closed and pinned by the events package unit tests.
Error sentinels
Every operation funnels through one of the package-local sentinels. Callers branch on these via errors.Is — wrapping is fine, identity must remain intact. The set is closed: adding a twelfth sentinel without updating errors_test.go trips the TestErrors_AreClosedSet drift gate at build time.
| Sentinel | Layer | Trigger | Remediation |
|---|---|---|---|
ErrCloudCredentialNotFound | Repository / Custodian | FindByID / Lookup for an absent CloudCredentialID. | Re-Lookup against the broker inventory or accept the row is gone; the custodian has no HTTP surface so there is no transport mapping today. |
ErrCloudCredentialAlreadyExists | Repository (constraint-name dispatch on cloud_credential_pkey PRIMARY KEY) | A second Issue for the same id with mismatched fields. | Caller treats the row as established; the custodian is idempotent on the credential id. |
ErrPathAlreadyMaterialised | Repository (constraint-name dispatch on cloud_credential_kv_path_unique UNIQUE on (kv_mount, kv_path)) | A chosen-credential-id collision against an existing Cloud's deterministic path. | The deterministic path derivation includes the Cloud UUID prefix, so a cross-Cloud collision is structurally impossible — the sentinel exists for the chosen-collision attack scenario and surfaces it loudly, distinct from ErrCloudCredentialAlreadyExists. |
ErrCloudNotFound | Repository (constraint-name dispatch on SQLSTATE 23503 FK violation on cloud_id) | Issue against a Cloud row that does not exist (or has been deleted). | Caller surfaces the missing Cloud upstream; no broker row, no KV-v2 write, no outbox event lands. |
ErrCloudCredentialRevoked | Custodian | Rotate on a row whose revoked_at is non-null. | Re-issue rather than resurrect; the aggregate is terminal. |
ErrBrokerCASConflict | Repository.RotateCAS | Broker-row version has advanced past the caller's ExpectedVersion. | Re-Lookup to observe the winning rotation, then retry with the new version. |
ErrKVStoreCASConflict | Materialiser.Put | KV-v2 store's own version has advanced past the broker's expected kv_version. | Distinct from ErrBrokerCASConflict: the KV-v2 store is out of sync with the broker row. Caller escalates the reconciliation incident — this is not a normal contention loss. |
ErrMaterialiserUnavailable | Materialiser.Put / Delete | Non-CAS KV-v2 failure (network, transport timeout, OpenBao unsealed-but-blocked) and the in-package fail-closed stub's posture when no OpenBao adapter is wired. | Fail closed: writing the broker row without a KV-v2 record would leave the platform with a credential id resolving to nothing. |
ErrIssueAtomicityViolated | Custodian | Compensating Materialiser.Delete fires after a Postgres rollback AND the delete itself also fails — the KV-v2 row is orphaned. | The custodian logs and re-raises with the original tx error AND the compensating-delete error joined via errors.Join; operators reconcile the orphaned KV-v2 row. |
ErrAuditUnavailable | Custodian (counter only) | AuditSink.Record fails after the custodian has committed its Postgres transaction. The custodian decision is durable but the audit chain has gapped. | Operators alert on AuditSinkFailuresTotal() (the process-wide counter) and chase down the audit-side outage. |
ErrInvalidPathInput | Repository / Materialiser | kv_path or kv_mount violates the path-format invariant before the SQL UNIQUE/NOT NULL gate runs (typically zero-UUID input to DerivePath returning the empty pair). | Programmer error; surfaces in tests, not in production. |
Unlike the OpenBao Credential Broker, the cloud-credentials custodian has no ErrDomainUnresolved sentinel. The aggregate is platform-scoped on Cloud — there is no DomainResolver port and no per-Domain residency pivot in the current shape. The DECISION block on doc.go explains the trade-off: pre-committing to a project_id / domain_id pair here would have forced a NULL-as-sentinel hazard in the schema and an unused DomainResolver port in the application layer. The request/approve workflow instead lives in the sibling credentialassignment sub-context with its own table — see ./rebac.md.
Sweeper cadence
The Sweeper runs on two cadences, both driven by the production composition root in cmd/plexsphere/cloudcredentials_factory_prod.go:
- Boot Run — synchronous, inside the shared reconcile-probe boot timeout horizon. The first successful pass flips the
*Sweeper.readyatomic soProbeFuncreturns nil and/readyzturns green. Until thenProbeFuncreturns theerrProbePendingsentinel and the orchestrator drains traffic. The probe is registered under the canonical namecloud-credentials-sweeper— dashboards and runbooks grep for that string verbatim. - Steady-state ticker —
time.NewTicker(Config.SweepInterval). The default is 30 s (defaultCloudCredentialsSweepInterval,PLEXSPHERE_CLOUD_CREDENTIALS_SWEEP_INTERVALoverrides it). Per-tick errors are logged and the next tick retries; the ticker honoursctx.Err()on every iteration. Each tick re-runsSweeper.Run, which is idempotent and self-terminating.
The boot probe and the ticker share one *Sweeper instance. ProbeFunc does NOT re-trigger Run — the bootstrap reconcile probe contract is that the probe re-runs the closure the composition root threads in. The DECISION block on ProbeFunc (sweeper.go) explains why a self-driving probe would double-count cloud_credentials_sweeper_invocations_total on every /readyz scrape.
Metrics
The Sweeper exports two zero-value-tolerant counter vectors via metrics.go:
| Metric | Type | Increments when |
|---|---|---|
plexsphere_cloud_credentials_sweeper_invocations_total | counter | Sweeper.Run is entered (per call). |
plexsphere_cloud_credentials_sweeper_expirations_total | counter | A row is successfully MarkExpired and the matching CloudCredentialExpired outbox row is appended. |
A nil prometheus.Registerer (WithRegisterer(nil) or no option) keeps the counters in zero-value mode — Run still walks the in-memory loop, but no scrape surface is registered. This is the deliberate posture for unit tests so registry collisions across parallel test runs cannot trip a global registry shared by production code paths. The prometheus.AlreadyRegisteredError branch reuses the existing collector so multiple service instances sharing one registry (typical in tests) do not panic on the second registration.
KV-v2 path derivation
The custodian, the audit projector, and the Sweeper all agree on where a credential lives by computing its path from the same deterministic helper. The helper lives in materialiser/materialiser.go and is the only path-shaping logic in the bounded context — every caller that needs a KV-v2 location reaches DerivePath rather than re-rolling the format.
text
mount = Config.KVMount (verbatim, e.g. "kv")
path = "clouds/<cloudID>/credentials/<credentialID>"<cloudID> and <credentialID> are the canonical 8-4-4-4-12 hyphenated UUID textual form, matching cloudcredentials.CloudCredentialID.String(). The OpenBao adapter applies the KV-v2 layout on top, so the data row for a credential lives at /<mount>/data/<path> and the metadata row at /<mount>/metadata/<path>. That extra layer is the OpenBao adapter's concern — DerivePath returns only the logical (mount, path) pair so the audit projector can compute it without importing the OpenBao client (and the depguard rules hold).
The format is part of the wire contract: once the custodian has issued a credential, its KV-v2 location is stable for the lifetime of the credential. Changing the format is a migration, not a refactor.
Zero-UUID handling
Both cloudID and credentialID must be non-zero. The DECISION block next to (*Materialiser).DerivePath (materialiser.go) documents why a zero argument returns the empty (mount, path) pair rather than panicking: the custodian's transactional path detects the empty pair and short-circuits with ErrInvalidPathInput before opening the Postgres transaction. The empty-pair return is a value that fits the existing failure surface and the unit test TestDerivePathRejectsZeroUUID pins this behaviour.
Operational model
The custodian is opt-in at the composition root. The PLEXSPHERE_CLOUD_CREDENTIALS_KV_MOUNT env var is the activation gate; when empty the binary keeps the custodian stub and the Sweeper does not run. The DECISION block on productionCloudCredentialsConfigFromEnv (cloudcredentials_factory_prod.go) explains why the custodian activates on KV mount rather than DSN: the external dependency (OpenBao KV-v2) is more sensitive than Postgres, and a half-wired KV path would write secret material to a place the operator did not authorise.
| Env var | Effect | Default |
|---|---|---|
PLEXSPHERE_CLOUD_CREDENTIALS_KV_MOUNT | Custodian activation gate. Empty = inert. Non-empty + empty PLEXSPHERE_DSN = build-time error. | "" (inert). |
PLEXSPHERE_DSN | The Postgres connection string the custodian shares with the rest of the platform. Required when PLEXSPHERE_CLOUD_CREDENTIALS_KV_MOUNT is set. | "". |
PLEXSPHERE_CLOUD_CREDENTIALS_OPENBAO_ADDRESS | The OpenBao cluster URL the secretstore client dials. Held on the config so a future env-driven step can build the *secretstore.Client from (KVAddress, KVAuth). | "". |
PLEXSPHERE_CLOUD_CREDENTIALS_SWEEP_INTERVAL | Steady-state ticker period. Parsed with time.ParseDuration; non-positive is rejected at build time. | 30s. |
PLEXSPHERE_CLOUD_CREDENTIALS_ALLOW_INSECURE_MATERIALISER | Opts the binary into the in-package Materialiser stub when no *secretstore.Client is wired. For early-boot dev clusters only. | false. |
Build-time validation refuses construction on the ErrCloudCredentialsKVMountRequired sentinel when DSN is set but KVMount is empty, surfacing a misconfigured operator before /readyz lights up green. The trade-off mirrors BuildProductionCredentialsFactory. The wiring receipt tests/workspace/cloudcredentials_factory_wiring_receipt_test.go asserts the factory is plumbed through cmd/plexsphere/main.go exactly once so a future refactor cannot silently disconnect the custodian from Run.
Threat model
The custodian shoulders the credential lifecycle for cloud-scoped secrets. Four attacker shapes drive its defences:
Attacker A — concurrent rotations / split-brain rotation
Two callers race a Rotate on the same credential. Without protection the loser would either overwrite the winner's KV-v2 write or persist a broker row whose kv_version no longer matches the live secret.
Defence. The custodian carries two distinct CAS counters:
versionon the broker row —RotateCASrunsUPDATE … WHERE cloud_credential_id = $1 AND version = $expectedso the loser observes zero rows updated and surfacesErrBrokerCASConflict.kv_versionon the KV-v2 store —Materialiser.Putpasses the expected version as the OpenBao CAS argument; a CAS mismatch surfacesErrKVStoreCASConflict.
The two sentinels are deliberately distinct so the remediation differs: a broker CAS miss means another rotate already won and the caller should re-Lookup; a KV CAS miss means the KV-v2 store is out of sync with the broker row and the caller should escalate the reconciliation incident. The integration test cloudcredentials_pool_rotate_test.go asserts the two are distinguishable via errors.Is.
Attacker B — chosen-credential-id path collision
A caller crafts a credential UUID that, after deterministic path derivation, collides with an existing credential in another Cloud. Without the UNIQUE on (kv_mount, kv_path) the second Issue would either overwrite the first credential's broker row or write to the same KV-v2 path.
Defence. The deterministic path includes the Cloud UUID prefix (clouds/<cloudID>/credentials/<credentialID>), so any cross-Cloud collision is structurally impossible. The PRIMARY KEY on cloud_credential_id gates the within-Cloud case. A constraint-name dispatch in the Repository adapter maps the SQLSTATE 23505 collision on cloud_credential_kv_path_unique to ErrPathAlreadyMaterialised so a second Issue surfaces a typed error rather than corrupting the inventory.
Attacker C — orphaned Cloud reference
A caller attempts to issue a credential against a Cloud that does not exist (or has been deleted concurrently), or an operator attempts to delete a Cloud while non-expired credentials still reference it. Without the FK + ON DELETE RESTRICT pair the broker inventory would carry rows whose cloud_id resolves to nothing — a forensic gap that makes attribution unreliable.
Defence. The schema declares cloud_id REFERENCES plexsphere.clouds(id) ON DELETE RESTRICT. An Issue against an absent Cloud surfaces SQLSTATE 23503 and the Repository adapter maps it to ErrCloudNotFound — no broker row, no KV-v2 write, no outbox event. A Cloud delete that observes at least one non-expired credential surfaces CloudNotEmptyError with ChildCounts.CloudCredentials > 0 from the count_cloud_credentials_for() helper, so operators see exactly which Cloud is non-empty before draining it. The integration test cloud_delete_blocked_by_credential_test.go asserts both directions: a Cloud with a live credential cannot be deleted, and a Cloud whose credentials have all been revoked + expired can be deleted again.
Attacker D — orphaned KV-v2 row
The Postgres transaction that wraps Issue / Rotate fails to commit after the Materialiser.Put already wrote to KV-v2. Without a compensating delete the platform would carry a KV-v2 row that no broker row references — exactly the forensic gap that makes custodian operations untrustworthy.
Defence. Custodian defers a compensating Materialiser.Delete that runs on Postgres rollback only (Issue path). The compensating delete is best-effort: a transport failure during the compensation itself surfaces ErrIssueAtomicityViolated so an operator can reconcile the orphan from the structured log. The DECISION block on the Custodian explains the loud-but-non-fatal posture — the custodian prefers to surface the orphan over papering it over with silent retry logic. Rotate does NOT fire a compensating Delete on rollback because the prior KV-v2 version is still durable and is the value the rolled-back broker row points at; the same DECISION block enumerates the rotation-specific trade-off.
Out of scope
- Per-Domain audit residency. The custodian is platform-scoped on Cloud; the per-Domain audit hash chain that the OpenBao Credential Broker mirrors lives in the
credentialssibling context and is not replicated here. The DECISION block ondoc.godocuments the trade-off. - Plaintext secret return. The custodian stores caller-supplied Material via the KV-v2 adapter; it does not generate secret material itself. Generation lives in the caller (the cloud consumer's own provisioning logic).
- Rotation cadence. The custodian enforces CAS but does not run scheduled rotations. The Sweeper is for TTL expiry, not for rotation.
- HTTP surface. The custodian has no
/v1/cloud-credentials*endpoint. A workspace-level drift gate scans the generated OpenAPI document for any/v1/cloud-credentials*path and asserts the count is zero so a future change cannot accidentally widen the custodian's blast radius onto the public HTTP surface. - Project assignment. The request/approve workflow that binds a cloud credential to one or more Projects lives in the sibling
credentialassignmentsub-context, not in this custodian. It does not add aproject_idcolumn to thecloud_credentialtable — the binding is a row in the separatecloud_credential_assignmenttable plus a ReBACcloudcredential#usestuple. See./rebac.mdfor the assignment lifecycle and the ReBAC contract.