Appearance
Signing-key rotation workflow
This page explains the end-to-end signing-key rotation workflow that ties the Signing Service to the signed event bus. It is the companion to the narrower deployment-scoping and rotation page, which focuses on the lifecycle state machine and the signing_key.valid_until retention semantics; this page covers the control-plane RPCs, the platform-wide fan-out contract, and the operator-facing CLI.
The workflow opens a finite overlap window during which both the retiring and the incoming signing key verify, fans the resulting signing_key_rotated event out to every plexd Node in the scope so their per-Domain public-key cache invalidates, and then closes the window so the old key is retired. The Signing Service is the only process that ever touches Ed25519 private key material; every other participant in the workflow holds public halves only.
Ubiquitous language
The terms below extend the Signing Service ubiquitous-language table. Names are preserved verbatim across the Go code, the proto contract, the audit log, and the operator-facing CLI.
| Term | Definition | Code anchor |
|---|---|---|
| OpenRotation | The control-plane RPC that begins a rotation for a given scope. Allocates the incoming key via the KeyProvider, persists it together with the open transition row in one transaction, publishes one signing_key_rotated event per Node in the scope, and returns the resulting RotationTransition. | internal/signing/grpc_server.go, internal/signing/rotation_service.go |
| CloseRotation | The control-plane RPC that completes an open rotation. Flips the key pair (old → retired, new → active), stamps the transition row closed, and best-effort-retires the old key handle through the provider. | internal/signing/grpc_server.go, internal/signing/rotation_service.go |
| RotationTransition | The aggregate that binds an outgoing OldKey and incoming NewKey to a Window. Cross-scope transitions (platform → domain, domain A → domain B) are rejected at construction time. | internal/signing/rotation/overlap.go |
| Window (OpenedAt, ClosesAt) | The (OpenedAt, ClosesAt) instant pair during which both halves of a rotation are advertised. Strict OpenedAt < ClosesAt invariant; a zero-duration overlap is rejected by NewWindow. | internal/signing/rotation/overlap.go |
| signing_key_rotated event | The Domain Event the rotation publishes. Carries the (scope, old_key_id, new_key_id, new_public_key, valid_from, opened_at, closes_at) tuple and reaches every plexd Node in the scope via the signed event bus. | internal/identity/tenancy/events/events.go |
| OutboxEventPublisher | The production EventPublisher implementation in the plexsphere-signer binary. Resolves the fan-out target set (the Nodes in the scope) and writes one outbox row per Node inside a dedicated publisher-owned transaction (NOT the rotation transaction). The repo-side CreateKeyAndOpenRotation commits first; the publisher then opens a second pgx.Tx to fan rows out. The two-phase posture is documented in Fan-out narrative and the per-Node rows are all-or-nothing within the publisher tx (mid-fan-out crash leaves either every row or none). | cmd/plexsphere-signer/event_publisher_outbox.go |
| --overlap-window flag | Operator-facing CLI flag on plexsphere-signer (alias env PLEXSPHERE_SIGNER_OVERLAP_WINDOW) that overrides the default rotation overlap duration. Zero or negative values are rejected at startup. | cmd/plexsphere-signer/main.go, internal/signing/rotation_service.go |
| Rollout | The platform-side orchestration wrapper that invokes OpenRotation over gRPC and writes one platform-level audit entry per call. Lives outside the signer so non-signer callers (the plexctl CLI, scheduled jobs, disaster-recovery scripts) reuse the same control-flow seam. | internal/signing/orchestration/rollout.go |
Lifecycle diagram
A rotation drives the SigningKey state machine from active through rotating and back to either active (on the incoming key) or retired (on the outgoing key). The transitions are pure functions in internal/signing/rotation/state.go; no code outside that package may produce a new state from an old one. The Window invariant lives in internal/signing/rotation/overlap.go and is rechecked at every construction site.
mermaid
stateDiagram-v2
[*] --> active: Activate (zero -> active)
active --> rotating: OpenRotation\n(pairs with a new key, opens Window)
rotating --> active: CloseRotation\n(new key promoted at ClosesAt)
rotating --> retired: CloseRotation\n(old key retired at ClosesAt)
active --> retired: Retire\n(emergency retirement)
retired --> [*]
note right of rotating
Window invariant:
OpenedAt < ClosesAt (strict)
See rotation/overlap.go
end noteFor the deeper state-machine treatment — including the boot-time Resume reconciliation that recovers an overlap window after a signer restart and the field-ownership split between signing_key_transition.(opened_at, closes_at) and signing_key.valid_until — see ./signing/deployment.md.
Fan-out narrative
RotationService.Open walks four steps in order. Steps 1 and 2 share the rotation transaction; steps 3 and 4 run AFTER that transaction has already committed, each in its own boundary. The atomicity contract is therefore scoped to the repo write: a crash, partition, or publisher error after step 2 leaves the rotation fully persisted but does NOT roll the repo write back. See the two-phase trade-off for the rationale.
- Provider mint.
KeyProvider.NewSigningKeyis invoked to materialise the incoming key. The provider returns the public half ([32]byte) plus an opaqueprovider_handle_id; the private bytes never enter the core binary's address space. A compensatingprovider.Retireis queued for every later failure path throughcompensateRetire. - Atomic persist.
KeyRepo.CreateKeyAndOpenRotation(internal/signing/migrations/repository_pg.go) runs the new-keyINSERT, the old-keyUPDATEtorotating, and the open transition-rowINSERTinside a singlepgx.Tx. This collapses the previously two-stepNewSigningKey+OpenRotationflow into one transaction — the atomicity property the atomic-persist narrative below pins. The transaction commits before step 3 begins. - Publish.
EventPublisher.PublishKeyRotatedbroadcasts thesigning_key_rotatedDomain Event. In production the publisher isOutboxEventPublisher(cmd/plexsphere-signer/event_publisher_outbox.go), which resolves the fan-out target set (the Nodes in the scope) and writes one outbox row per Node inside its OWNpgx.Tx(opened viapgxOutboxTxRunner.InTxincmd/plexsphere-signer/outbox_publisher_wiring.go). That publisher tx is independent of the step-2 rotation tx, which has already committed. An empty target set is a clean no-op; a single row-write failure rolls the publisher tx back (so the fan-out is all-or-nothing within step 3) and surfaces the error toRotationService.Open, which logs at WARN level without unwinding the already-committed rotation state. - Audit. A single
audit.Entryis appended withRelation = "rotate-open"byAuditMiddleware(internal/signing/audit_middleware.go), honouring the exactly-one-entry-per-RPC contract that theSignandPublicKeyRPCs already obey.
Two-phase fan-out trade-off
A literal single-transaction shape — repo write + outbox fan-out sharing one pgx.Tx — would require the EventPublisher port to take a caller-supplied pgx.Tx, which would (a) leak pgx into the internal/signing package boundary (today the package is pgx-free — the persistence adapter lives behind a port) and (b) couple the RotationService's transactional lifecycle to the publisher's implementation. The current shape keeps the boundary clean at the cost of a narrow crash window between step 2 and step 3 in which the rotation is persisted but no outbox rows have been written.
The crash window is small but not zero. RotationService.Resume at signer restart re-reads the open transition row and auto-closes it when the overlap window has already passed, but it deliberately does NOT republish the signing_key_rotated event — the contract is exactly-one-rotation = exactly-one-fan-out per logical rotation. Operators recovering from a crash in the step-2 / step-3 window therefore observe a persisted rotation with no consumer-side cache invalidation; the per-Node public-key caches drain naturally on the configured TTL, and a fresh OpenRotation on the same scope is refused with ErrRotationInProgress until the operator closes the half-fanned-out rotation. The runbook in ../how-to/platform/rotate-the-signing-key.md documents the recovery procedure.
On the consumer side, every plexd Node subscribes to the signed event bus and routes signing_key_rotated payloads into DomainSigningKeyResolver.Subscribe (internal/mesh/sse/signing_key_resolver.go). The resolver calls Reset(DomainID) per event, which evicts the per-Domain entry from the public-key cache so the next PublicKey call re-fetches the live (KeyID, [32]byte) pair through the SigningKeyResolver seam. Channel-close is a clean shutdown signal; context cancellation aborts cleanly without leaking the goroutine.
CloseRotation runs the symmetric tail at end of overlap: it flips the persisted state (old → retired, new → active), stamps the transition row closed, best-effort-retires the old provider handle, and writes the matching rotate-close audit entry. The close path does not republish a signing_key_rotated event — the cache invalidation that landed at OpenRotation time already pinned the new public half on every Node.
OpenRotation / CloseRotation contract
The two control-plane RPCs are declared in api/proto/signing/v1/signing.proto and implemented by GRPCServer in internal/signing/grpc_server.go. Both delegate through the SignerPort interface so the production composition root wraps them with AuditMiddleware to emit exactly one audit.Entry per RPC.
OpenRotation
OpenRotation(scope, new_key_id) → (old_key_id, new_key_id, opened_at, closes_at)
| Field | Direction | Semantics |
|---|---|---|
scope | request | Wire form platform or domain:<uuid>; parsed by ParseScope. Malformed input maps to InvalidArgument. |
new_key_id | request | The caller-chosen identifier of the incoming key (URL-safe, ≤ 128 bytes). Validated by NewKeyID; malformed input maps to InvalidArgument. |
old_key_id | response | The retiring key's ID, resolved server-side from the current active key. |
opened_at, closes_at | response | The persisted overlap window. closes_at = opened_at + overlapWindow. |
CloseRotation
CloseRotation(scope, old_key_id, new_key_id) → ()
| Field | Direction | Semantics |
|---|---|---|
scope | request | Wire form platform or domain:<uuid>; parsed by ParseScope. Malformed input maps to InvalidArgument. |
old_key_id, new_key_id | request | The matching pair from the open transition row. Validated by NewKeyID; malformed input maps to InvalidArgument. |
Stable status-detail strings
Every error returned by the two RPCs is mapped to a gRPC status with a byte-for-byte stable detail string by mapSignErrorToStatus. Callers branch on the detail; the prose is the contract. The strings live as package-level constants at the top of internal/signing/grpc_server.go.
| Domain sentinel | gRPC code | Detail string |
|---|---|---|
ErrRotationInProgress | FailedPrecondition | signing: rotation in progress |
ErrKeyRetired | FailedPrecondition | signing: key retired |
ErrScopeMismatch | NotFound | signing: scope mismatch |
ErrKeyNotFound | NotFound | signing: key not found |
ErrInvariant (and every error wrapping it, including ErrScopeNotPermittedForProfile) | InvalidArgument | signing: invariant violation |
ErrProviderUnavailable | Unavailable | signing: key provider unavailable |
ErrClientIdentityDenied | PermissionDenied | signing: client identity denied |
| any other unmapped error | Internal | signing: internal error |
Paraphrasing these strings between releases is a documented contract break: the operator runbook and the plexctl signing exit-code catalogue grep for them verbatim.
--overlap-window flag
The default rotation overlap duration is 24 hours, pinned by defaultOverlapWindow in internal/signing/rotation_service.go. Operators override it at signer bring-up through a single CLI flag on the plexsphere-signer binary:
shell
plexsphere-signer --overlap-window 12hor via the environment alias:
shell
PLEXSPHERE_SIGNER_OVERLAP_WINDOW=12h plexsphere-signerBoth surfaces are parsed in cmd/plexsphere-signer/main.go. A zero or negative value is rejected at startup with --overlap-window must be positive; the signer never boots advertising a degenerate overlap. When the flag is omitted the signing.defaultOverlapWindow constant is used directly — the bring-up path threads WithOverlapWindow only when the operator set the flag explicitly so the package-level default remains the single source of truth.
Atomic-persist narrative
Earlier iterations of RotationService.Open ran the new-key INSERT and the open transition-row INSERT as two separate repository calls. A crash, network partition, or compensating-retire failure between the two left the database in an inconsistent state: the new signing_key row existed with state active while no transition row pointed at it. Subsequent OpenRotation attempts then collided with the orphan row at the unique-state predicate, masking the partial failure as ErrRotationInProgress and forcing a manual operator clean-up.
KeyRepo.CreateKeyAndOpenRotation (declared in internal/signing/ports.go and implemented in internal/signing/migrations/repository_pg.go) collapses the two INSERTs and the old-key UPDATE into a single pgx.Tx via the existing runInTx helper. The repository classifies errors through the existing per-statement classifier so the caller still sees ErrRotationInProgress on a concurrent open, but the failure now rolls every row back together — there is no observable half-open state.
The compensating-retire flow on the service side was simplified in lockstep: RotationService.Open calls compensateRetire only on the provider-side failures (the NewSigningKey mint and the post-persist provider.Retire path); a CreateKeyAndOpenRotation rollback is self-contained because the transaction either lands every row or none. The end-to-end pin lives in tests/integration/signer_rotation_atomicity_test.go, which exercises the atomic compound against the real testcontainers Postgres + sqlc adapter: a happy-path Open asserts exactly one transition row; a concurrent-Open race triggers ErrRotationInProgress with no orphan rows; and a direct repo.CreateKeyAndOpenRotation call against a pre-seeded colliding key trips the (scope, key_id) unique index and asserts every row rolled back (the colliding row's state, the old key's state, and the absence of any transition row).
See also
./signing/index.md— Signing Service entry point and the broader ubiquitous-language table../signing/deployment.md— deployment- scoping matrix, the three-state rotation lifecycle, the boot-timeResumereconciliation, and thevalid_untilretention split../signing/operations.md— gRPC surface, the per-RPC audit contract, and the operator debugging workflow../mesh/sse.md— the signed event bus thesigning_key_rotatedevent rides on; see the event-taxonomy table for the wire-level row.../how-to/platform/rotate-the-signing-key.md— operator runbook for opening and closing a rotation throughplexctl signing rotateandplexctl signing close.../../tests/integration/signer_rotation_atomicity_test.go— fault-injection pin for the atomicCreateKeyAndOpenRotationrepository call.../../tests/integration/signing_rotation_sse_fanout_test.go— end-to-end pin for the per-Node outbox fan-out and theRotationService.Resumeno-republish path.../../tests/e2e/security/signer-rotation-sse-fanout/chainsaw-test.yaml— Chainsaw E2E pinning the fan-out across a stub Node fleet under NATS + JetStream.../../tests/e2e/cli/signing-rotate/chainsaw-test.yaml— Chainsaw E2E driving theplexctl signing rotate/plexctl signing closeexit-0 path against the kind-loaded signer.