Skip to content

Key Rotation Workflow

This document is the authoritative bounded-context reference for the key-rotation workflow that ships alongside the peers sub-context under ../../../internal/mesh/peers/. It covers the ubiquitous language, the plexsphere.peer_key_rotation schema, the pending-to-completed-or- superseded lifecycle, the two-channel dispatch model (the SSE rotate_keys command plus the heartbeat hint), the two new Manager commands RequestRotation and CompleteRotation, the rotation-ordering invariants CompleteRotation upholds, the rotate_keys / peer_key_rotated event payloads, the refusal-code table for the two HTTP surfaces, and the rationale for the deliberate absence of a background sweeper.

Key rotation is the control-plane workflow that lets an operator replace a Node's WireGuard mesh key — because the key is aged or possibly compromised — without re-enrolling the Node from scratch. The operator triggers a rotation; the control plane records a pending rotation and dispatches an imperative rotate_keys command to the Node; the Node generates a fresh Curve25519 keypair and submits the new public key; the control plane records the new key, re-issues the Node's pairwise PSK, and notifies every mesh peer via a peer_key_rotated event. The rotation is one aggregate with a short lifecycle, two Manager commands, two HTTP surfaces, and two new outbox event types; no existing boundary is rewritten.

Ubiquitous language

The terms below travel verbatim across the Go code, the SQL schema, the outbox event payloads, the structured-log attributes, and any operator-facing tooling. Keep the names stable.

TermDefinition
Rotation requestThe operator-initiated start of a rotation. Recorded by Manager.RequestRotation: it resolves the live Peer for a (Domain, Node) pair, inserts a pending plexsphere.peer_key_rotation row, and appends exactly one rotate_keys outbox event — the imperative command dispatched to the Node. Idempotent: a re-request against a Node that already has a pending rotation returns the existing row and emits no second command.
Rotation completionThe Node-initiated finish of a rotation. Recorded by Manager.CompleteRotation after the Node has generated a fresh Curve25519 keypair and submitted the new public key via POST /v1/keys/rotate. In one transaction it overwrites plexsphere.nodes.public_key, retires the prior PSK, re-issues a fresh wrapped PSK, appends one peer_key_rotated event, and flips the rotation row to completed.
KeyRotation aggregateThe aggregate root for one key-rotation attempt a Node makes. One row in plexsphere.peer_key_rotation, addressed by a surrogate id (UUIDv7). At most one row per Node may be in pending state at a time — the partial unique index peer_key_rotation_pending_uq is the durable name of that invariant.
stateThe rotation lifecycle discriminator. pending marks a rotation requested but not yet completed; completed marks one whose new key the Node generated and submitted; superseded marks a pending rotation a later rotation request pre-empted before it completed. The closed set {pending, completed, superseded} is pinned by a SQL CHECK and mirrored verbatim by the peers.RotationState value object.
previous_public_key / new_public_keyThe Node's WireGuard public keys before and after the rotation, each exactly 32 bytes (the Curve25519 public-key size per RFC 7748 §5). Both columns are NULLable: new_public_key is unknown until completion, and previous_public_key is absent for a rotation requested before the Node ever presented a key.
rotate_keysThe imperative outbox event the control plane dispatches to the rotating Node's per-node SSE subject leaf. Unlike the five past-tense peer notifications, rotate_keys is a command: it tells the Node to generate a fresh Curve25519 keypair and call POST /v1/keys/rotate.
peer_key_rotatedThe past-tense outbox event every peer of the rotated Node observes once a rotation completes, so each peer can update its WireGuard configuration with the rotated Node's new public key and PSK reference.
Heartbeat hintThe rotate_keys boolean on the heartbeat response. It is true exactly while a pending peer_key_rotation row exists for the heartbeating Node — a second, poll-based dispatch channel so a Node that reconnected rather than holding a live SSE stream still learns it must rotate within one heartbeat interval. It is a coordination signal, never a security control.

Schema

The key-rotation workflow owns one new table in the plexsphere schema. The authoritative source is the migration at ../../../internal/platform/db/migrations/0032_peer_key_rotation.sql; the column meanings below paraphrase that file's DECISION blocks. The migration's Down block deliberately refuses the downgrade with SQLSTATE 0A000 because the rotation history is the forensic evidence chain that links every Node's WireGuard public-key transition to the wall-clock window the rotation spanned.

plexsphere.peer_key_rotation — KeyRotation aggregate

ColumnTypeMeaning
iduuid PRIMARY KEYSurrogate aggregate id (UUIDv7) the application layer mints per rotation attempt. Stamped onto the outbox row's aggregate_id.
domain_iduuid NOT NULLFK to plexsphere.domains(id) ON DELETE RESTRICT. Carries the Domain isolation boundary.
node_iduuid NOT NULLFK to plexsphere.nodes(id) ON DELETE RESTRICT. The Node whose mesh key is rotating; also the column the partial unique index keys on.
peer_iduuid NOT NULLFK to plexsphere.peer(id) ON DELETE RESTRICT. The live Peer aggregate the rotation is bound to.
statetext NOT NULL CHECK (state IN ('pending', 'completed', 'superseded'))The lifecycle discriminator. The SQL CHECK pins the closed set so a mis-wired INSERT fails at the database rather than promoting an unknown value into the domain.
requested_attimestamptz NOT NULL DEFAULT now()Wall-clock instant the rotation was requested.
previous_public_keybytea NULL CHECK (previous_public_key IS NULL OR octet_length(previous_public_key) = 32)The Node's Curve25519 public key before the rotation. NULL when the caller did not supply it.
new_public_keybytea NULL CHECK (new_public_key IS NULL OR octet_length(new_public_key) = 32)The Node's Curve25519 public key after the rotation. NULL while the row is pending; the completion flow stamps it.
completed_attimestamptz NULLWall-clock instant the rotation completed. NULL marks the row in flight; the completion flow stamps it alongside flipping state to completed.

The table carries one CHECK constraint and one partial index:

  • peer_key_rotation_completed_after_requested CHECK (completed_at IS NULL OR completed_at >= requested_at) — time-monotonic completion, mirroring the psk_retired_after_issued and peer_relay_assignment_retired_after_assigned shapes.
  • peer_key_rotation_pending_uq UNIQUE (node_id) WHERE state = 'pending' — the partial unique index that is the durable name of the one-pending-rotation-per-Node invariant. It mirrors psk_live_peer_uq; completed and superseded rows pile up alongside the pending row in the same table as the rotation history.

requested_by is deliberately omitted from the table. The acting operator subject is recorded by the Platform Audit Log on the trigger; duplicating it onto the aggregate row would create a second source of truth for the same fact.

Rotation lifecycle

A peer_key_rotation row walks a short three-state lifecycle. The state column is the discriminator; pending is the only state the partial unique index constrains.

text
   (no row)

        │  Manager.RequestRotation

   pending  (completed_at IS NULL, new_public_key IS NULL)

        ├─ Manager.CompleteRotation ─────────────► completed
        │     (stamps new_public_key + completed_at,
        │      flips state in the same transaction)

        └─ Manager.RequestRotation (a fresh request
           pre-empts the in-flight one) ─────────► superseded
  • pending → completed is the happy path: the Node received the rotate_keys command, generated its keypair, and called back. CompleteRotation flips the row.
  • pending → superseded is the pre-emption path: a later rotation request arrives while a rotation is still in flight. The prior pending row is flipped to superseded so the partial unique index admits the fresh pending row.
  • A pending row that is never completed and never superseded simply stays pending. There is no timer that retires it — see No-sweeper rationale below.

completed and superseded are terminal: once a row leaves pending it never returns. A Node that needs another rotation gets a fresh row with a fresh id.

Two-channel dispatch model

The control plane tells a Node to rotate over two independent channels. Either one alone is sufficient; together they cover the reconnect gap.

  1. The SSE rotate_keys event. RequestRotation appends a rotate_keys outbox event in the same transaction that inserts the pending row. The event rides the existing outbox → SSE relay → publisher path and lands on the rotating Node's per-node subject leaf. A Node holding a live SSE stream learns of the rotation immediately. This is the primary channel.
  2. The heartbeat hint. POST /v1/nodes/{id}/heartbeat sets HeartbeatResponse.rotate_keys to true whenever a pending peer_key_rotation row exists for the heartbeating Node. A Node that reconnected — rather than holding a continuous SSE stream — and therefore missed the live rotate_keys frame still re-learns the rotation intent within one heartbeat interval.

The heartbeat hint is a coordination signal, not a security control. Its read port, PendingRotationLookup, is deliberately kept out of the fail-closed 501 gate: when the port is unwired the heartbeat handler returns rotate_keys: false and neither errors nor returns 501. Suppressing the hint degrades a convenience, not a guarantee — the durable pending row plus the SSE Last-Event-ID replay path still carry the rotation intent.

The SSE Last-Event-ID replay path is the third leg: a Node that reconnects its SSE stream and replays from its last sequence observes the rotate_keys frame again. The durable pending row, the heartbeat hint, and SSE replay together make a background sweeper redundant.

Manager commands

The workflow adds two commands to the existing peers.Manager, both in ../../../internal/mesh/peers/rotation.go. Each follows the established Cmd / Result struct-pair shape and runs its whole body inside one Store.InTx callback so every write lands atomically.

RequestRotation

Manager.RequestRotation(ctx, RequestRotationCmd) → (RequestRotationResult, error). Inside one Store.InTx it:

  1. Resolves the live Peer for the (DomainID, NodeID) pair.
  2. Probes for an existing pending rotation. If one exists, it returns that row with AlreadyPending true and emits no second event — the command is idempotent, so an operator (or a retrying caller) cannot stack duplicate pending rotations or re-dispatch the command.
  3. Otherwise inserts a fresh pending peer_key_rotation row and appends exactly one rotate_keys outbox event.

An unknown (Domain, Node) pair — no live Peer — returns ErrPeerNotFound and writes nothing.

CompleteRotation

Manager.CompleteRotation(ctx, CompleteRotationCmd) → (CompleteRotationResult, error). The Node has, by this point, received the rotate_keys command, generated a fresh Curve25519 keypair, and submitted the new public key. CompleteRotation validates that key — see Refusal codes — then in one Store.InTx:

  1. Overwrites plexsphere.nodes.public_key with the new key.
  2. Retires the Peer's prior live PSK.
  3. Runs the AssignPSK wrap cycle verbatim — resolves the active per-Domain wrap key, wraps the fresh PSK plaintext, fails closed on a wrapper-version / active-row divergence (the same ErrNoActiveWrapKey posture as AssignPSK), and inserts the re-issued plexsphere.psk row.
  4. Appends exactly one peer_key_rotated outbox event — the last write.
  5. Flips the peer_key_rotation row to completed, stamping new_public_key and completed_at.

The caller-supplied new-PSK plaintext slice is zeroed by a deferred destroyPlaintext scrub on every return path — success, every validation rejection, and the wrap-key fail-closed path — so the secret never lingers in process memory after the command returns.

Rotation-ordering invariants

CompleteRotation upholds a set of ordering invariants that together guarantee a downstream consumer never observes a partial rotation. They are exercised by the unit tests in ../../../internal/mesh/peers/rotation_test.go and the integration tests under ../../../tests/integration/.

  • Atomicity. All five writes — nodes.public_key, PSK retire, PSK insert, peer_key_rotated append, rotation-row flip — land in a single Store.InTx callback. A failure on any step rolls back every prior step; nodes.public_key is never left updated without the matching PSK and event.
  • Event-last. The peer_key_rotated event is appended after every aggregate state change. A consumer observing the event is therefore guaranteed every state change it describes has already landed — the event is never a promise of a write still to come.
  • Fail-closed on wrap-key drift. A divergence between the wrapper's bound version and the active wrap-key row rolls the transaction back and leaves nodes.public_key unchanged, wrapping ErrNoActiveWrapKey. A rotation never persists a PSK row whose wrap_key_version FK references a version the bytes were not sealed under.
  • Plaintext zeroed on every path. The deferred destroyPlaintext scrub zeroes the caller's new-PSK plaintext slice on the success path and on every error path.
  • Pre-write key validation. A submitted key that is the wrong length, the all-zero degenerate value, or byte-identical to the Node's current key is rejected before any database write — see the refusal table below.
  • Relay assignment untouched. CompleteRotation issues no peer_relay_assignment query. A rotation against a Peer holding a live relay assignment leaves that row byte-for-byte unchanged — bridge_node_id, fallback_endpoint, fallback_endpoint_port, assigned_at, and retired_at all survive. Relay-fallback assignment is orthogonal to key rotation; see ./relay-fallback.md.
  • Idempotent completion. A CompleteRotation retry with the same key against an already-completed rotation returns the prior result and appends no second peer_key_rotated event.

Event types

The workflow adds two outbox event types to the peers sub-context's closed set, taking it from four to six. Both are constructed in ../../../internal/identity/tenancy/events/events.go and admitted by the AST gate at ../../../tests/workspace/peers_event_type_set_test.go. The bare snake_case literals are the wire contract; downstream relay and JetStream consumers switch on them verbatim.

rotate_keys

Constructed by events.NewRotateKeys; constant TypeRotateKeys = "rotate_keys". Appended by Manager.RequestRotation. Unlike the five past-tense peer notifications, rotate_keys is an imperative command dispatched to the addressed Node's per-node subject leaf — it instructs the Node to generate a fresh Curve25519 keypair and call POST /v1/keys/rotate. The payload carries only the (Peer, Domain, Node) ID triple; the command needs no further parameters because the Node owns its own keypair generation.

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>"
}

peer_key_rotated

Constructed by events.NewPeerKeyRotated; constant TypePeerKeyRotated = "peer_key_rotated". Appended by Manager.CompleteRotation as the last write. Every peer of the rotated Node observes it so each peer can update its WireGuard configuration with the rotated Node's new public key and the re-issued PSK reference.

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>",
  "new_public_key": "<base64 of the 32-byte Curve25519 public key>",
  "kid": "<wrap-key identifier of the re-issued PSK>",
  "wrap_key_version": 1
}

The payload posture is load-bearing. peer_key_rotated carries the rotated Node's new public key — public key material is safe to broadcast — together with the (kid, wrap_key_version) reference to the re-issued pairwise PSK row. It deliberately carries NO PSK plaintext and NO ciphertext. A peer that needs the PSK reads the canonical wrapped row out of plexsphere.psk and threads (kid, wrap_key_version) into the unwrap call — exactly as it does for peer_psk_assigned. Keeping only the reference on the bus keeps the "no secret material on the wire" invariant load-bearing at the event boundary, not just at the storage boundary, and avoids duplicating the wrapped ciphertext onto every peer's event copy.

Refusal codes

The two HTTP surfaces — the node-facing POST /v1/keys/rotate and the operator-facing POST /v1/nodes/{id}/keys/rotate — translate the Manager and repository sentinels into Problem-Details responses. The code values below are the stable machine-readable discriminators; the string detail text is diagnostic only.

POST /v1/keys/rotate — node-facing rotation completion

HTTPProblem codeWhen
401unauthorizedThe Node Secret Key in the Authorization: Bearer header is missing, malformed, or revoked — the rotating Node could not be resolved.
400malformed_keys_rotate_requestThe request body did not decode as a KeysRotateRequest envelope (invalid JSON or an unknown field).
413keys_rotate_body_too_largeThe request body exceeded the body cap.
422keys_rotate_public_key_invalidnew_public_key is not a 32-byte Curve25519 key, or is the all-zero degenerate value. Rejected before the Peer is resolved.
404keys_rotate_peer_not_foundNo live Peer row resolves for the authenticated Node.
409keys_rotate_no_pending_rotationThe Node has no pending rotation row and the submitted key matches no already-completed rotation — there is nothing to complete and the request is not an idempotent retry. Surfaces peers.ErrNoPendingRotation. A retry of a key whose rotation already completed instead returns 200 with the prior receipt — see Idempotent completion.
422keys_rotate_public_key_unchangednew_public_key is byte-identical to the Node's current public key. Surfaces peers.ErrRotationPublicKeyUnchanged; carries a field-level detail.
501keys_rotate_not_provisionedA load-bearing port (the rotation completer, the NSK resolver, the Node repository, or the Peer lookup) is unwired. The fail-closed shim.

POST /v1/nodes/{id}/keys/rotate — operator-facing rotation trigger

HTTPProblem codeWhen
401unauthorizedNo authenticated principal.
403insufficient_relationThe caller lacks the act permission on the target Node's Resource. The ReBAC check runs before any existence check, and an audit row with outcome insufficient_relation is recorded on the deny path.
404node_not_foundThe target Node id resolves to no Node.
501node_keys_rotate_not_provisionedThe rotation requester or the ReBAC / authn ports are unwired. The fail-closed shim.

The audit-sink port is excluded from both fail-closed gates: a nil audit sink degrades observability, not the security decision, so it must not over-fire the 501.

No-sweeper rationale

The workflow introduces no background sweeper goroutine and no reconcile probe. A peer_key_rotation row that stays pending — because the Node never called back — is resolved by an operator re-request, not by a timer. This is a deliberate KISS / YAGNI choice:

  • The dispatch is already covered twice. The durable pending row plus the heartbeat hint plus SSE Last-Event-ID replay together guarantee a live Node re-learns a pending rotation. A sweeper would be a redundant third channel.
  • A sweeper cannot tell "stuck" from "slow". It would have to guess a Node's liveness, and could retire a key a slow-but-healthy Node is still about to rotate — a race a timer cannot win.
  • The operator is the right retry authority. RequestRotation is idempotent: re-issuing the trigger against a Node with a pending rotation returns the existing row and re-dispatches the rotate_keys command without stacking duplicate rows. Making the operator the retry authority keeps the workflow observable — a stuck rotation surfaces as a visible pending row an operator acts on, not as silent timer churn.

Cross-references

  • ./peers.md — the Key and Peer Manager sub-context: the Peer aggregate, the wrapped pairwise PSK lifecycle the AssignPSK wrap cycle is reused from, and the closed event-type set this workflow extends from four to six.
  • ./sse.md — the Signed SSE Event Bus that carries the rotate_keys command and the peer_key_rotated notification verbatim across the relay boundary.
  • ./endpoints.md — the NAT endpoint intake surface and the peer_endpoint_changed event; rotation leaves the endpoint snapshot untouched.
  • ./relay-fallback.md — the per-peer-pair relay-fallback assignment surface that CompleteRotation leaves byte-for-byte unchanged.
  • ./reachability.md — the per-Node reachability projection; the heartbeat that carries the rotation hint also drives the reachability state machine.
  • ../../reference/cli/plexctl/key.md — the matching plexctl reference for plexctl key rotate (including the --dry-run preview branch).