Skip to content

Cloud Credentials Custodian

This is the authoritative bounded-context reference for the Cloud Credentials Custodian that ships under internal/provisioning/cloudcredentials/. The custodian is the sub-context of plexsphere provisioning that owns the lifecycle of cloud-scoped secret material: issuance, rotation, revocation, and TTL-driven expiry. The Postgres broker row is the durable inventory; the secret bytes themselves live in OpenBao KV-v2 at a path the custodian derives deterministically from (cloud_id, cloud_credential_id). Every aggregate state change appends a typed domain event to the shared platform outbox inside the same Postgres transaction as the broker-row mutation, preserving the at-least-once delivery contract used elsewhere.

The custodian has no HTTP surface of its own. Callers reach it through the in-process Custodian facade; the assignment of a Cloud Credential to one or more Projects (the request/approve workflow) lives in the sibling credentialassignment sub-context, which owns its own HTTP surface. The closed Custodian port is the contract that lets every emitter remain unaware of which adapter is wired (internal/provisioning/cloudcredentials/ports.go).

Pages

This bounded-context reference is intentionally a single page. The custodian's surface area is narrower than identity or audit, and the ports / events / sentinels travel in lockstep — splitting them across files would force a reader to chase the same DECISION block through three pages. Cross-cutting code anchors are linked inline from each section.

Cross-references

  • ../../contributing/layout.md — the bounded-context map row that locates internal/provisioning/cloudcredentials inside the codebase and enumerates the depguard rules that keep this module free of cross-context imports.
  • ./credentials.md — the sibling OpenBao Credential Broker bounded context. The cloud-credentials custodian mirrors the broker's shape (Custodian / Materialiser / Repository / Sweeper) but is platform-scoped on Cloud rather than per-Project per-Domain. The DECISION block on internal/provisioning/cloudcredentials/doc.go enumerates the aggregate-keying trade-off.
  • ./rebac.md — the Credential Assignment sub-context and its ReBAC contract. The request/approve/reject/revoke workflow that binds a cloud credential to a Project lives there; this custodian owns the credential aggregate itself, not its assignments.
  • ../identity/tenancy.md — the Domain → Project → Resource → Node aggregate model. The cloud credentials custodian deliberately does NOT pivot on Project or Domain: it is platform-scoped on Cloud at the current milestone.
  • ../../reference/provisioning/cloud-credential-pool.md — the per-port API reference: field tables for IssueInput / RotateInput / RevokeInput / Material / CloudCredentialRow, the error-sentinel table, and the outbox-event schema table.
  • ../../../internal/provisioning/cloudcredentials/doc.go — package-level pin of the ubiquitous language and the depguard rationale, including the DECISION block on why the aggregate is keyed on (cloud_id, credential_id) and platform-scoped.
  • ../../../internal/platform/db/migrations/0022_cloud_credentials.sql — the persistence schema for plexsphere.cloud_credential and plexsphere.cloud_credential_outbox_token.
  • ../../../tests/e2e/provisioning/credential-pool-rotation/chainsaw-test.yaml — Chainsaw e2e suite that issues + rotates a cloud credential against a real Postgres + OpenBao stack and asserts the (kv_version, version) pair advances and the outbox emits exactly one cloudcredentials.CloudCredentialRotated row.

Ubiquitous language

The terms below travel together across the Go code, the SQL migration, the outbox event payloads, the structured-log attributes, and operator-facing tooling. Names are preserved verbatim in error messages and outbox payloads so a reader chasing a string from a log line finds it in the source without translation.

TermDefinitionCode anchor
CloudCredentialThe aggregate root. One row in plexsphere.cloud_credential carrying (cloud_credential_id, cloud_id, display_name, kv_mount, kv_path, version, kv_version, expires_at, revoked_at, expired_at, created_at, updated_at). The shape is frozen — the custodian changes no field without a numbered migration.internal/provisioning/cloudcredentials/ports.go, 0022_cloud_credentials.sql
CloudCredentialIDThe 16-byte UUIDv7 primary key of the broker row. The String() projection is the canonical 8-4-4-4-12 hyphenated form so log lines from the custodian join cleanly with sibling contexts. The zero value is rejected by every aggregate invariant that requires a concrete reference.internal/provisioning/cloudcredentials/ports.go
MaterialThe opaque secret value the caller hands the custodian on Issue and Rotate. Carries Payload []byte, a flat KeyValues map[string]string, and the custodian's TTL time.Duration from which expires_at is derived. Caller-owned bytes — the custodian MUST defensively copy both before any persistence call so subsequent caller mutation cannot reach the stored row. The type's String() method returns a redacted descriptor so accidental %v / %+v / slog formatting cannot leak secret bytes into a log surface.internal/provisioning/cloudcredentials/ports.go
CustodianThe in-process facade through which application services interact with the cloud-credentials pool. Methods: Issue, Rotate, Revoke, Lookup. Implementations orchestrate the Repository, Materialiser, audit, and outbox in a single Postgres transaction; callers branch on the package's sentinel errors via errors.Is.internal/provisioning/cloudcredentials/ports.go
MaterialiserThe KV-v2 adapter port. Put writes data at /<mount>/data/<path> with a CAS expectation and returns the new KV-v2 version. Delete soft-deletes the secret. DerivePath returns the deterministic (mount, path) pair so the custodian, the audit log, and the Sweeper all agree on where a credential lives. The default in-package adapter fails closed (ErrMaterialiserUnavailable) on Put/Delete; the OpenBao-backed adapter ships under the cloudcredentials_openbao build tag.internal/provisioning/cloudcredentials/materialiser/materialiser.go
RepositoryThe persistence port. Carries Create, FindByID, RotateCAS, Revoke, ListExpired, MarkExpired, AppendOutboxEvent, RecordOutboxToken, FindOutboxToken, and RunInTx. The Postgres adapter is a thin wrapper over the sqlc-generated queries; constraint-name dispatch maps SQLSTATE 23505 collisions and SQLSTATE 23503 FK violations onto the canonical sentinels.internal/provisioning/cloudcredentials/ports.go, internal/provisioning/cloudcredentials/repo/credentials_pg.go
SweeperThe steady-state TTL expiry worker. Run(ctx) walks Repository.ListExpired in pages, applies MarkExpired + AppendOutboxEvent + RecordOutboxToken per row inside one RunInTx, and flips the /readyz readiness flag on its first clean return. Boot-time Run gates /readyz; subsequent ticks honour Config.SweepInterval.internal/provisioning/cloudcredentials/sweeper/sweeper.go
AuditSinkThe audit-emission port the Custodian writes through (Record(ctx, entry) error). The shape mirrors internal/audit.Sink so the composition root wires a thin shim that translates AuditEntry into audit.Entry without pulling the audit package into this module. Failures are counted (AuditSinkFailuresTotal) and logged after-commit, not re-raised — the broker row + outbox event are durable by the time the audit hand-off runs.internal/provisioning/cloudcredentials/ports.go, internal/provisioning/cloudcredentials/custodian.go
broker-row versionThe CAS counter on the plexsphere.cloud_credential row. Incremented by every application-side mutation (issue / rotate / revoke / mark-expired). Repository writers pass an expected_version precondition; a mismatch returns ErrBrokerCASConflict rather than silently overwriting.0022_cloud_credentials.sql, E0_cloud_credentials.sql
kv_versionThe OpenBao KV-v2 store's own monotonic per-path counter. Surfaces in bao kv metadata output. The custodian mirrors the post-write value on the broker row so operators can correlate the broker row against KV-v2 metadata without round-tripping to OpenBao. Distinct from the broker-row version because the remediation differs (see Error sentinels).0022_cloud_credentials.sql
lifecycle stampsexpires_at is the wall-clock instant the credential ceases to be valid. revoked_at is non-NULL once the operator has issued a soft-delete. expired_at is non-NULL once the Sweeper has observed the credential's expiry. The Sweeper's eligibility predicate is revoked_at IS NULL AND expired_at IS NULL AND expires_at <= now() and is backed by a partial index. A SQL CHECK gates the revoked_at / expired_at exclusivity.0022_cloud_credentials.sql
cloud_credential_outbox_tokenThe sibling table holding one row per (cloud_credential_id, event_type) pair under a PRIMARY KEY. Structural at-most-once guarantee for every custodian lifecycle event: a retried Revoke or a Sweeper-issued MarkExpired never appends a second CloudCredentialRevoked / CloudCredentialExpired row to the outbox. The event_type column carries a snake_case discriminator from the migration's CHECK allow-list (cloud_credential_issued / _rotated / _revoked / _expired); the in-Go mapping lives on events.TokenEventType.0022_cloud_credentials.sql, internal/provisioning/cloudcredentials/events/events.go

Aggregates

The CloudCredential aggregate is the only aggregate this context owns. The aggregate root is the plexsphere.cloud_credential row; the plexsphere.cloud_credential_outbox_token row is a structural at-most-once token, not a domain object — it carries no value the domain reasons about beyond the (cloud_credential_id, event_type) identity.

CloudCredential — invariants

The aggregate enforces six invariants that travel across the SQL schema, the Repository adapter, and the Custodian application service:

InvariantLayerFailure mode
(kv_mount, kv_path) is unique.SQL UNIQUE constraint cloud_credential_kv_path_unique.ErrPathAlreadyMaterialised — a chosen-credential-id collision against an existing Cloud's path.
cloud_credential_id is unique.SQL PRIMARY KEY.ErrCloudCredentialAlreadyExists — a re-Issue with mismatched fields.
cloud_id references an existing row in plexsphere.clouds(id).SQL FOREIGN KEY with ON DELETE RESTRICT.ErrCloudNotFound on an absent FK target; a Cloud delete observes CloudNotEmptyError while at least one non-expired credential references it.
revoked_at IS NULL OR expired_at IS NULL.SQL CHECK.The CHECK fires on the corner case where both stamps are non-null; in normal operation the application transitions through one terminal stamp.
kv_path is the deterministic projection of (cloud_id, cloud_credential_id).Application service (Materialiser.DerivePath) + persistence (broker row INSERT).Drift between the two surfaces produces a chain-of-custody mismatch the Sweeper would detect on the next pass.
Issue / Rotate / Revoke / MarkExpired runs inside one Postgres transaction with the matching outbox append and token record.Application service (Custodian + RunInTx).A partial commit would either leak a KV-v2 row without a broker row (ErrIssueAtomicityViolated) or commit a broker row without an outbox event (the at-least-once contract would degrade to no-shot).

CloudCredential — state machine

The CloudCredential is a three-stage terminal aggregate. Once the row enters either terminal stamp (revoked_at or expired_at) no further rotation is possible — operators re-issue rather than resurrect.

text
  ┌──────────────────────────────────────────────────────────┐
  │  Issue                                                    │
  │   · Mint CloudCredentialID (UUIDv7)                       │
  │   · Materialiser.DerivePath → (mount, path)              │
  │   · Materialiser.Put (KV-v2 CAS=0 — first write)         │
  │   · INSERT plexsphere.cloud_credential                   │
  │   · AppendOutboxEvent (CloudCredentialIssued)            │
  │   · RecordOutboxToken (cloud_credential_issued)          │
  │   · compensating Materialiser.Delete on tx rollback      │
  └──────────────────┬───────────────────────────────────────┘
                     │  version = 1, kv_version = 1

  ┌──────────────────────────────────────────────────────────┐
  │  Active                                                   │
  │   · Rotate → version' = version + 1                      │
  │              kv_version' = kv_version + 1                │
  │   · Lookup → SELECT row                                   │
  └────────┬───────────────────────────────────┬─────────────┘
           │ Revoke                            │ Sweeper.Run + expires_at ≤ now
           ▼                                   ▼
  ┌─────────────────────┐             ┌─────────────────────┐
  │ Revoked (terminal)  │             │ Expired (terminal)  │
  │   revoked_at != NULL│             │   expired_at != NULL│
  │   CloudCredentialRev│             │   CloudCredentialExp│
  └─────────────────────┘             └─────────────────────┘

Ports

The cloudcredentials package declares four ports (internal/provisioning/cloudcredentials/ports.go) plus an AuditSinkPort alias declared in the Sweeper. Each port is the narrow seam through which the bounded context reaches its collaborators; the depguard no-cross-context-imports rule keeps the cloudcredentials module free of direct imports of internal/identity, internal/audit, or internal/platform/secretstore.

PortMethodsAdapterTest seam
CustodianIssue, Rotate, Revoke, Lookupconcrete *custodian in custodian.go; wired into the production binary by cmd/plexsphere/cloudcredentials_factory_prod.go (composition root). NewCustodian refuses construction on any nil load-bearing port (Repository / Materialiser / AuditSink) — operator-misconfigured composition fails at build time, not on the first Issue.unit-level fakes in custodian_test.go cover Issue / Rotate / Revoke / Lookup including the compensating-delete and CAS-conflict branches; integration-level tests at tests/integration/cloudcredentials_pool_*_test.go drive the production Custodian against a real Postgres + OpenBao stack via the testcontainers fixtures.
MaterialiserPut, Delete, DerivePathdefault fail-closed stub in materialiser/materialiser.go; real OpenBao adapter under //go:build cloudcredentials_openbao in materialiser/openbao/openbao.go. The OpenBao adapter wraps *secretstore.Client.KVPut / KVDelete and translates SDK CAS-mismatch wording onto cloudcredentials.ErrKVStoreCASConflict.DerivePath is pure logic — the unit tests exercise it without a fake (TestDerivePathRejectsZeroUUID, TestDerivePathFormat); the build-tagged adapter has its own unit tests at materialiser/openbao/openbao_test.go; integration coverage runs against a real OpenBao container.
RepositoryCreate, FindByID, RotateCAS, Revoke, ListExpired, MarkExpired, AppendOutboxEvent, RecordOutboxToken, FindOutboxToken, RunInTxPostgres in repo/credentials_pg.go, wrapping the sqlc-generated Queries from E0_cloud_credentials.sql. Constraint-name dispatch maps cloud_credential_pkey to ErrCloudCredentialAlreadyExists, cloud_credential_kv_path_unique to ErrPathAlreadyMaterialised, and a 23503 FK violation on cloud_id to ErrCloudNotFound.the unit tests inject a fake sqlc layer that asserts each constraint-name dispatch path; the build-tagged repo/credentials_pg_test.go integration tests pin Create idempotency, RotateCAS happy + ErrBrokerCASConflict, Revoke idempotency, and ListExpired pagination.
AuditSinkRecord(ctx, AuditEntry) → errorwired via a one-method pass-through shim in the composition root that translates cloudcredentials.AuditEntry into audit.Entry. The Sweeper holds its own AuditSinkPort alias with the same shape (sweeper.AuditEntry / sweeper.AuditSinkPort) so the sub-package stays free of an internal/audit import.the Custodian unit tests use an in-memory recording sink; the integration tests reuse the same sink shape for emission-ordering assertions.
SweeperRunconcrete *Sweeper in sweeper/sweeper.go; ProbeFunc is the /readyz mount, exported through the composition root as CloudCredentialsWiring.SweepProbe.unit tests use a fixed Clock and an in-memory Repository; the boot-probe + ticker contracts are pinned by tests/integration/cloudcredentials_pool_sweeper_test.go and the app-level boot test at cmd/plexsphere/app_cloudcredentials_wiring_test.go.

The Clock port is a convenience seam used by both the Custodian and the Sweeper; it lives in the same ports.go file so a reviewer auditing the bounded-context surface finds every external dependency in one place. There is no DomainResolver port — the CloudCredential aggregate is platform-scoped on (cloud_id, credential_id) (see DECISION block in doc.go).

Outbox events

The custodian emits four typed domain events to plexsphere.outbox_events inside the same Postgres transaction as the broker-row mutation. Event types are written to the event_type column verbatim — the string form is part of the wire contract once a row has been emitted. Each event row carries aggregate_type = "cloud_credential" so the relay routes cloud-credentials rows distinctly from sibling contexts.

Event type (column value)TriggerPayload struct
cloudcredentials.CloudCredentialIssuedCustodian.Issue succeeds.events.CloudCredentialIssued
cloudcredentials.CloudCredentialRotatedCustodian.Rotate succeeds (CAS-protected).events.CloudCredentialRotated
cloudcredentials.CloudCredentialRevokedCustodian.Revoke succeeds. Carries the operator-supplied Reason.events.CloudCredentialRevoked
cloudcredentials.CloudCredentialExpiredSweeper.Run observes a row with expires_at <= now() AND revoked_at IS NULL AND expired_at IS NULL.events.CloudCredentialExpired

Every payload carries a UUIDv7 EventID, a UTC OccurredAt timestamp, and the relevant credential identity. The event_type set is closed — adding a fifth value is a breaking schema change, not a switch-statement extension. The package-local drift gate TestEventTypesAreClosedSet enforces the four-event allow-list. Field-level shapes are pinned in the reference page.

At-most-once via cloud_credential_outbox_token

The sibling plexsphere.cloud_credential_outbox_token table holds one row per (cloud_credential_id, event_type) pair under a PRIMARY KEY. Each custodian transition (issued / rotated / revoked / expired) MUST emit the corresponding outbox event at-most-once per credential. The PRIMARY KEY is the smallest invariant that closes the door at the storage layer: a retried Revoke or a Sweeper-issued MarkExpired's INSERT trips a unique violation, the application catches it and skips the outbox append while leaving the broker row in its terminal state. This is the structural guarantee that the at-least-once outbox contract degrades to exactly-once for every custodian lifecycle event — see the file-header DECISION block on 0022_cloud_credentials.sql for the full rationale.

The event_type column on the token table is a snake_case discriminator gated by a CHECK allow-list (cloud_credential_issued / _rotated / _revoked / _expired). The in-Go mapping from the outbox event_type literal to the token-table discriminator lives on events.TokenEventType; the mapping is closed and pinned by the events package unit tests.

Error sentinels

Every operation funnels through one of the package-local sentinels. Callers branch on these via errors.Is — wrapping is fine, identity must remain intact. The set is closed: adding a twelfth sentinel without updating errors_test.go trips the TestErrors_AreClosedSet drift gate at build time.

SentinelLayerTriggerRemediation
ErrCloudCredentialNotFoundRepository / CustodianFindByID / Lookup for an absent CloudCredentialID.Re-Lookup against the broker inventory or accept the row is gone; the custodian has no HTTP surface so there is no transport mapping today.
ErrCloudCredentialAlreadyExistsRepository (constraint-name dispatch on cloud_credential_pkey PRIMARY KEY)A second Issue for the same id with mismatched fields.Caller treats the row as established; the custodian is idempotent on the credential id.
ErrPathAlreadyMaterialisedRepository (constraint-name dispatch on cloud_credential_kv_path_unique UNIQUE on (kv_mount, kv_path))A chosen-credential-id collision against an existing Cloud's deterministic path.The deterministic path derivation includes the Cloud UUID prefix, so a cross-Cloud collision is structurally impossible — the sentinel exists for the chosen-collision attack scenario and surfaces it loudly, distinct from ErrCloudCredentialAlreadyExists.
ErrCloudNotFoundRepository (constraint-name dispatch on SQLSTATE 23503 FK violation on cloud_id)Issue against a Cloud row that does not exist (or has been deleted).Caller surfaces the missing Cloud upstream; no broker row, no KV-v2 write, no outbox event lands.
ErrCloudCredentialRevokedCustodianRotate on a row whose revoked_at is non-null.Re-issue rather than resurrect; the aggregate is terminal.
ErrBrokerCASConflictRepository.RotateCASBroker-row version has advanced past the caller's ExpectedVersion.Re-Lookup to observe the winning rotation, then retry with the new version.
ErrKVStoreCASConflictMaterialiser.PutKV-v2 store's own version has advanced past the broker's expected kv_version.Distinct from ErrBrokerCASConflict: the KV-v2 store is out of sync with the broker row. Caller escalates the reconciliation incident — this is not a normal contention loss.
ErrMaterialiserUnavailableMaterialiser.Put / DeleteNon-CAS KV-v2 failure (network, transport timeout, OpenBao unsealed-but-blocked) and the in-package fail-closed stub's posture when no OpenBao adapter is wired.Fail closed: writing the broker row without a KV-v2 record would leave the platform with a credential id resolving to nothing.
ErrIssueAtomicityViolatedCustodianCompensating Materialiser.Delete fires after a Postgres rollback AND the delete itself also fails — the KV-v2 row is orphaned.The custodian logs and re-raises with the original tx error AND the compensating-delete error joined via errors.Join; operators reconcile the orphaned KV-v2 row.
ErrAuditUnavailableCustodian (counter only)AuditSink.Record fails after the custodian has committed its Postgres transaction. The custodian decision is durable but the audit chain has gapped.Operators alert on AuditSinkFailuresTotal() (the process-wide counter) and chase down the audit-side outage.
ErrInvalidPathInputRepository / Materialiserkv_path or kv_mount violates the path-format invariant before the SQL UNIQUE/NOT NULL gate runs (typically zero-UUID input to DerivePath returning the empty pair).Programmer error; surfaces in tests, not in production.

Unlike the OpenBao Credential Broker, the cloud-credentials custodian has no ErrDomainUnresolved sentinel. The aggregate is platform-scoped on Cloud — there is no DomainResolver port and no per-Domain residency pivot in the current shape. The DECISION block on doc.go explains the trade-off: pre-committing to a project_id / domain_id pair here would have forced a NULL-as-sentinel hazard in the schema and an unused DomainResolver port in the application layer. The request/approve workflow instead lives in the sibling credentialassignment sub-context with its own table — see ./rebac.md.

Sweeper cadence

The Sweeper runs on two cadences, both driven by the production composition root in cmd/plexsphere/cloudcredentials_factory_prod.go:

  1. Boot Run — synchronous, inside the shared reconcile-probe boot timeout horizon. The first successful pass flips the *Sweeper.ready atomic so ProbeFunc returns nil and /readyz turns green. Until then ProbeFunc returns the errProbePending sentinel and the orchestrator drains traffic. The probe is registered under the canonical name cloud-credentials-sweeper — dashboards and runbooks grep for that string verbatim.
  2. Steady-state tickertime.NewTicker(Config.SweepInterval). The default is 30 s (defaultCloudCredentialsSweepInterval, PLEXSPHERE_CLOUD_CREDENTIALS_SWEEP_INTERVAL overrides it). Per-tick errors are logged and the next tick retries; the ticker honours ctx.Err() on every iteration. Each tick re-runs Sweeper.Run, which is idempotent and self-terminating.

The boot probe and the ticker share one *Sweeper instance. ProbeFunc does NOT re-trigger Run — the bootstrap reconcile probe contract is that the probe re-runs the closure the composition root threads in. The DECISION block on ProbeFunc (sweeper.go) explains why a self-driving probe would double-count cloud_credentials_sweeper_invocations_total on every /readyz scrape.

Metrics

The Sweeper exports two zero-value-tolerant counter vectors via metrics.go:

MetricTypeIncrements when
plexsphere_cloud_credentials_sweeper_invocations_totalcounterSweeper.Run is entered (per call).
plexsphere_cloud_credentials_sweeper_expirations_totalcounterA row is successfully MarkExpired and the matching CloudCredentialExpired outbox row is appended.

A nil prometheus.Registerer (WithRegisterer(nil) or no option) keeps the counters in zero-value mode — Run still walks the in-memory loop, but no scrape surface is registered. This is the deliberate posture for unit tests so registry collisions across parallel test runs cannot trip a global registry shared by production code paths. The prometheus.AlreadyRegisteredError branch reuses the existing collector so multiple service instances sharing one registry (typical in tests) do not panic on the second registration.

KV-v2 path derivation

The custodian, the audit projector, and the Sweeper all agree on where a credential lives by computing its path from the same deterministic helper. The helper lives in materialiser/materialiser.go and is the only path-shaping logic in the bounded context — every caller that needs a KV-v2 location reaches DerivePath rather than re-rolling the format.

text
  mount = Config.KVMount                              (verbatim, e.g. "kv")
  path  = "clouds/<cloudID>/credentials/<credentialID>"

<cloudID> and <credentialID> are the canonical 8-4-4-4-12 hyphenated UUID textual form, matching cloudcredentials.CloudCredentialID.String(). The OpenBao adapter applies the KV-v2 layout on top, so the data row for a credential lives at /<mount>/data/<path> and the metadata row at /<mount>/metadata/<path>. That extra layer is the OpenBao adapter's concern — DerivePath returns only the logical (mount, path) pair so the audit projector can compute it without importing the OpenBao client (and the depguard rules hold).

The format is part of the wire contract: once the custodian has issued a credential, its KV-v2 location is stable for the lifetime of the credential. Changing the format is a migration, not a refactor.

Zero-UUID handling

Both cloudID and credentialID must be non-zero. The DECISION block next to (*Materialiser).DerivePath (materialiser.go) documents why a zero argument returns the empty (mount, path) pair rather than panicking: the custodian's transactional path detects the empty pair and short-circuits with ErrInvalidPathInput before opening the Postgres transaction. The empty-pair return is a value that fits the existing failure surface and the unit test TestDerivePathRejectsZeroUUID pins this behaviour.

Operational model

The custodian is opt-in at the composition root. The PLEXSPHERE_CLOUD_CREDENTIALS_KV_MOUNT env var is the activation gate; when empty the binary keeps the custodian stub and the Sweeper does not run. The DECISION block on productionCloudCredentialsConfigFromEnv (cloudcredentials_factory_prod.go) explains why the custodian activates on KV mount rather than DSN: the external dependency (OpenBao KV-v2) is more sensitive than Postgres, and a half-wired KV path would write secret material to a place the operator did not authorise.

Env varEffectDefault
PLEXSPHERE_CLOUD_CREDENTIALS_KV_MOUNTCustodian activation gate. Empty = inert. Non-empty + empty PLEXSPHERE_DSN = build-time error."" (inert).
PLEXSPHERE_DSNThe Postgres connection string the custodian shares with the rest of the platform. Required when PLEXSPHERE_CLOUD_CREDENTIALS_KV_MOUNT is set."".
PLEXSPHERE_CLOUD_CREDENTIALS_OPENBAO_ADDRESSThe OpenBao cluster URL the secretstore client dials. Held on the config so a future env-driven step can build the *secretstore.Client from (KVAddress, KVAuth)."".
PLEXSPHERE_CLOUD_CREDENTIALS_SWEEP_INTERVALSteady-state ticker period. Parsed with time.ParseDuration; non-positive is rejected at build time.30s.
PLEXSPHERE_CLOUD_CREDENTIALS_ALLOW_INSECURE_MATERIALISEROpts the binary into the in-package Materialiser stub when no *secretstore.Client is wired. For early-boot dev clusters only.false.

Build-time validation refuses construction on the ErrCloudCredentialsKVMountRequired sentinel when DSN is set but KVMount is empty, surfacing a misconfigured operator before /readyz lights up green. The trade-off mirrors BuildProductionCredentialsFactory. The wiring receipt tests/workspace/cloudcredentials_factory_wiring_receipt_test.go asserts the factory is plumbed through cmd/plexsphere/main.go exactly once so a future refactor cannot silently disconnect the custodian from Run.

Threat model

The custodian shoulders the credential lifecycle for cloud-scoped secrets. Four attacker shapes drive its defences:

Attacker A — concurrent rotations / split-brain rotation

Two callers race a Rotate on the same credential. Without protection the loser would either overwrite the winner's KV-v2 write or persist a broker row whose kv_version no longer matches the live secret.

Defence. The custodian carries two distinct CAS counters:

  • version on the broker row — RotateCAS runs UPDATE … WHERE cloud_credential_id = $1 AND version = $expected so the loser observes zero rows updated and surfaces ErrBrokerCASConflict.
  • kv_version on the KV-v2 store — Materialiser.Put passes the expected version as the OpenBao CAS argument; a CAS mismatch surfaces ErrKVStoreCASConflict.

The two sentinels are deliberately distinct so the remediation differs: a broker CAS miss means another rotate already won and the caller should re-Lookup; a KV CAS miss means the KV-v2 store is out of sync with the broker row and the caller should escalate the reconciliation incident. The integration test cloudcredentials_pool_rotate_test.go asserts the two are distinguishable via errors.Is.

Attacker B — chosen-credential-id path collision

A caller crafts a credential UUID that, after deterministic path derivation, collides with an existing credential in another Cloud. Without the UNIQUE on (kv_mount, kv_path) the second Issue would either overwrite the first credential's broker row or write to the same KV-v2 path.

Defence. The deterministic path includes the Cloud UUID prefix (clouds/<cloudID>/credentials/<credentialID>), so any cross-Cloud collision is structurally impossible. The PRIMARY KEY on cloud_credential_id gates the within-Cloud case. A constraint-name dispatch in the Repository adapter maps the SQLSTATE 23505 collision on cloud_credential_kv_path_unique to ErrPathAlreadyMaterialised so a second Issue surfaces a typed error rather than corrupting the inventory.

Attacker C — orphaned Cloud reference

A caller attempts to issue a credential against a Cloud that does not exist (or has been deleted concurrently), or an operator attempts to delete a Cloud while non-expired credentials still reference it. Without the FK + ON DELETE RESTRICT pair the broker inventory would carry rows whose cloud_id resolves to nothing — a forensic gap that makes attribution unreliable.

Defence. The schema declares cloud_id REFERENCES plexsphere.clouds(id) ON DELETE RESTRICT. An Issue against an absent Cloud surfaces SQLSTATE 23503 and the Repository adapter maps it to ErrCloudNotFound — no broker row, no KV-v2 write, no outbox event. A Cloud delete that observes at least one non-expired credential surfaces CloudNotEmptyError with ChildCounts.CloudCredentials > 0 from the count_cloud_credentials_for() helper, so operators see exactly which Cloud is non-empty before draining it. The integration test cloud_delete_blocked_by_credential_test.go asserts both directions: a Cloud with a live credential cannot be deleted, and a Cloud whose credentials have all been revoked + expired can be deleted again.

Attacker D — orphaned KV-v2 row

The Postgres transaction that wraps Issue / Rotate fails to commit after the Materialiser.Put already wrote to KV-v2. Without a compensating delete the platform would carry a KV-v2 row that no broker row references — exactly the forensic gap that makes custodian operations untrustworthy.

Defence. Custodian defers a compensating Materialiser.Delete that runs on Postgres rollback only (Issue path). The compensating delete is best-effort: a transport failure during the compensation itself surfaces ErrIssueAtomicityViolated so an operator can reconcile the orphan from the structured log. The DECISION block on the Custodian explains the loud-but-non-fatal posture — the custodian prefers to surface the orphan over papering it over with silent retry logic. Rotate does NOT fire a compensating Delete on rollback because the prior KV-v2 version is still durable and is the value the rolled-back broker row points at; the same DECISION block enumerates the rotation-specific trade-off.

Out of scope

  • Per-Domain audit residency. The custodian is platform-scoped on Cloud; the per-Domain audit hash chain that the OpenBao Credential Broker mirrors lives in the credentials sibling context and is not replicated here. The DECISION block on doc.go documents the trade-off.
  • Plaintext secret return. The custodian stores caller-supplied Material via the KV-v2 adapter; it does not generate secret material itself. Generation lives in the caller (the cloud consumer's own provisioning logic).
  • Rotation cadence. The custodian enforces CAS but does not run scheduled rotations. The Sweeper is for TTL expiry, not for rotation.
  • HTTP surface. The custodian has no /v1/cloud-credentials* endpoint. A workspace-level drift gate scans the generated OpenAPI document for any /v1/cloud-credentials* path and asserts the count is zero so a future change cannot accidentally widen the custodian's blast radius onto the public HTTP surface.
  • Project assignment. The request/approve workflow that binds a cloud credential to one or more Projects lives in the sibling credentialassignment sub-context, not in this custodian. It does not add a project_id column to the cloud_credential table — the binding is a row in the separate cloud_credential_assignment table plus a ReBAC cloudcredential#uses tuple. See ./rebac.md for the assignment lifecycle and the ReBAC contract.