Skip to content

Key and Peer Manager

This document is the authoritative bounded-context reference for the peers sub-context that ships under ../../../internal/mesh/peers/. It covers the ubiquitous language, the schema overview for the plexsphere.peer and plexsphere.psk tables, the closed set of six outbox event types the Manager emits, the AES-256-GCM encrypt-at-rest envelope the wrapped pairwise PSK travels in, the threat model that keeps plaintext PSK material off both the wire and the database, and the rationale behind the internal/mesh/peers depguard carve-out .

The peers sub-context is the Key and Peer Manager: it owns the per-Domain Peer aggregate (one row per (Domain, Node) pair) and the wrapped pairwise PSK lifecycle that travels alongside it. The aggregate is addressed by surrogate id and kept in plexsphere.peer; the live PSK is kept in plexsphere.psk and is unwrapped under a per-Domain AES-256 wrap key resolved from plexsphere.domain_nsk_wrap_key. The wrap-key ledger itself is owned by the Node-registration context — peers consumes wrap-key rows but does not mint or rotate them.

Ubiquitous language

The terms below travel verbatim across the Go code, the SQL schema, the outbox event payloads, and the per-Domain audit trail. Keep the names stable in error messages, structured-log attributes, and any operator-facing tooling.

TermDefinition
PeerThe aggregate root for one (Domain, Node) registration the mesh owns. Surrogate id (peer.id, UUIDv7) plus a natural-key UNIQUE on (domain_id, node_id). Lifecycle: registered (deregistered_at IS NULL) -> deregistered (deregistered_at IS NOT NULL); soft-delete keeps the row available for forensic replay.
PSK (pairwise pre-shared key)The symmetric secret a plexd agent and its peer share for WireGuard handshake authentication. Wrapped at rest under the per-Domain wrap key; plaintext PSK is never persisted.
Wrap KeyA per-Domain AES-256 key whose lifecycle is owned by the Node-registration context. The peers package consumes wrap-key rows from plexsphere.domain_nsk_wrap_key via Repository.GetActiveWrapKey but does not mint or rotate them.
kidOpaque human-readable wrap-key identifier the Manager stamps onto every freshly-issued PSK row. Mirrors the JOSE kid convention. SQL CHECK on plexsphere.psk.kid pins the regex ^[A-Za-z0-9._:-]+$. The software wrapper defaults to psk-software-v1.
wrap_key_versionPer-Domain monotonic version of the wrap key the wrap was produced under. Composite FK from plexsphere.psk(domain_id, wrap_key_version) to plexsphere.domain_nsk_wrap_key(domain_id, version) guarantees a PSK row cannot reference a wrap-key version that does not exist for its Domain.
Wrapped PSKThe AES-256-GCM ciphertext of the PSK plaintext, formatted as <12-byte nonce> || <ciphertext+16-byte tag>. SQL CHECK pins the length to BETWEEN 28 AND 4096 bytes.
ManagerThe sole command-side entry point for the sub-context (Register, AssignPSK, Deregister). Defined in ../../../internal/mesh/peers/manager.go; emits exactly six outbox event types from the closed wire-contract set.
WrapperThe AES-256-GCM seam (Wrap / Unwrap) the Manager uses to encrypt PSK plaintext under the active wrap key. Production wires *softwareWrapper; a future HSM-backed adapter can override the wrap path without touching the Manager.
RepositoryThe per-transaction port the Manager calls inside the Store callback. Production wires a sqlc-generated adapter; tests inject a peers-package value-type fake without importing pgx.
StoreThe transaction-callback seam (InTx(ctx, fn) error) the Manager runs every command inside. Mirrors the relay convention in ../../../internal/mesh/sse/relay.go so the repo-wide convention stays uniform.
ProviderThe read-side projection (SnapshotForDomain) the registration service consumes. Lands in the follow-on peer-snapshot integration; owns the read port and keeps the Manager / Provider split between command and query surfaces.
Aggregate idpeer.id (UUIDv7). The id the Manager stamps onto every outbox row's aggregate_id column so consumers can route per-peer events without joining back to plexsphere.peer.
Endpoint observationThe publicly-observed (host, port) tuple a plexd agent reports through the NAT-discovery intake surface, captured on the Peer aggregate as the (last_endpoint, last_endpoint_port, last_endpoint_reported_at) triple. The control plane gates reported_at against the server's wall clock before persisting; the wire layer renders the typed columns as a host:port string. See ./endpoints.md for the full intake surface.
Endpoint sweeperThe periodic background sweep that stamps last_endpoint_stale_at on every Peer whose last_endpoint_reported_at is older than the per-Domain endpoint_ttl. Each transition emits a peer_endpoint_changed event with an empty endpoint so downstream consumers know the direct path is no longer trustworthy. Implemented in ../../../internal/mesh/peers/endpoint_sweeper.go; see ./endpoints.md for the operational contract.

Schema

The peers sub-context owns four tables in the plexsphere schema — plexsphere.peer, plexsphere.psk, plexsphere.peer_relay_assignment, and plexsphere.peer_key_rotation. The authoritative sources are the migrations at ../../../internal/platform/db/migrations/0012_peers.sql, ../../../internal/platform/db/migrations/0030_peer_relay_assignment.sql, and ../../../internal/platform/db/migrations/0032_peer_key_rotation.sql; the column meanings below paraphrase those files' DECISION blocks. All three migrations' Down blocks deliberately refuse the downgrade with SQLSTATE 0A000 because the wrapped-PSK ciphertext chain, the per-peer relay-assignment history, and the per-Node key-rotation history are all security-critical audit evidence.

plexsphere.peer — Peer aggregate row

ColumnTypeMeaning
iduuid PRIMARY KEYSurrogate aggregate id (UUIDv7). Stamped onto every outbox row's aggregate_id so consumers route per-peer without joining back to this table.
domain_iduuid NOT NULLFK to plexsphere.domains(id) ON DELETE RESTRICT. Half of the natural key.
node_iduuid NOT NULLFK to plexsphere.nodes(id) ON DELETE RESTRICT. The other half of the natural key. RESTRICT prevents an in-flight Node deletion from orphaning a live peer row; the Deregister flow runs first.
created_attimestamptz NOT NULL DEFAULT now()Insertion timestamp.
updated_attimestamptz NOT NULL DEFAULT now()Maintained by the BEFORE UPDATE trigger plexsphere.peer_set_updated_at so the application layer cannot accidentally drift the column off the wall-clock.
deregistered_attimestamptz NULLSoft-delete tombstone. NULL means "live"; non-NULL means "no longer live" but kept for forensic replay.
last_endpointinet NULLHost portion of the most recent NAT-observed endpoint the agent reported, stored as a Postgres inet so address-family is validated at the storage boundary and containment queries (<<) work for triage. NULL means "no observation has ever landed". Added by migration 0029_peer_endpoint.sql.
last_endpoint_portinteger NULL CHECK (last_endpoint_port IS NULL OR last_endpoint_port BETWEEN 1 AND 65535)Port portion of the most recent NAT-observed endpoint, paired with last_endpoint. The CHECK pins the RFC 6056 ephemeral-port range so a malformed handler push cannot smuggle a zero or out-of-band port into the column.
last_endpoint_reported_attimestamptz NULLWall-clock instant the agent attached to its endpoint PUT body, coerced to UTC. Backs the staleness range scan and the freshness gate; NULL means "no observation has ever landed".
last_endpoint_stale_attimestamptz NULLStale-tombstone timestamp the endpoint sweeper stamps when an observation passes its per-Domain endpoint_ttl without refresh. NULL means "the live observation is still within TTL"; non-NULL means "the sweeper has tombstoned this row and emitted a stale peer_endpoint_changed event".

Constraints and indexes:

  • peer_domain_node_uq UNIQUE (domain_id, node_id) — the natural-key invariant Manager.Register relies on. A duplicate INSERT trips pgerrcode 23505 and the repository classifies it as peers.ErrPeerAlreadyRegistered.
  • peer_deregistered_after_created CHECK (deregistered_at IS NULL OR deregistered_at >= created_at) — pins time monotonicity for the soft-delete.
  • peer_domain_idx ON plexsphere.peer (domain_id) WHERE deregistered_at IS NULL — backs the Provider's SnapshotForDomain read so the snapshot stays a cheap b-tree range scan even after long-running Domains accumulate soft-deleted rows.
  • peer_endpoint_stale_idx ON plexsphere.peer (last_endpoint_reported_at) WHERE deregistered_at IS NULL AND last_endpoint_stale_at IS NULL — backs the endpoint sweeper's range scan "Peers whose last_endpoint_reported_at is older than the per-Domain endpoint_ttl and which are still in a watchable state". The partial predicate keeps deregistered and already-stale rows out of the index so each sweeper tick stays a cheap b-tree range scan rather than a sequential scan across the whole tenancy. Added by migration 0029_peer_endpoint.sql.

plexsphere.psk — wrapped pairwise PSK

ColumnTypeMeaning
peer_iduuid PRIMARY KEY1:1 with plexsphere.peer(id), ON DELETE RESTRICT. The 1:1 design lets the partial unique index psk_live_peer_uq enforce "at most one live PSK per peer" while retired rows sit alongside.
domain_iduuid NOT NULLHalf of the composite FK to the wrap-key ledger.
kidtext NOT NULL CHECK (kid ~ '^[A-Za-z0-9._:-]+$')Opaque wrap-key identifier.
wrapped_pskbytea NOT NULL CHECK (length(wrapped_psk) BETWEEN 28 AND 4096)The AES-256-GCM ciphertext (<nonce> || <ct+tag>). The lower bound is the GCM floor (12 nonce + 16 tag); the upper bound caps any future HSM/PKCS#11 wrap form without inviting unbounded blob writes.
wrap_key_versioninteger NOT NULL CHECK (wrap_key_version >= 1)Other half of the composite FK to the wrap-key ledger.
issued_attimestamptz NOT NULL DEFAULT now()Insertion timestamp.
retired_attimestamptz NULLSoft-retirement tombstone. NULL means "live"; non-NULL means "retired" but kept for forensic replay of the wrap-key chain.

Constraints and indexes:

  • `psk_retired_after_issued CHECK (retired_at IS NULL OR retired_at

    = issued_at)` — time-monotonic retirement.

  • psk_wrap_key_fk FOREIGN KEY (domain_id, wrap_key_version) REFERENCES plexsphere.domain_nsk_wrap_key (domain_id, version) ON DELETE RESTRICT — the composite FK that pins the wrap-key chain at the SQL layer. A single-column FK on wrap_key_version alone would let a PSK row reference a wrap-key version belonging to a different Domain, silently breaking the wrap chain on forensic replay; the composite FK forecloses that drift.
  • psk_live_peer_uq UNIQUE INDEX ON plexsphere.psk (peer_id) WHERE retired_at IS NULL — the partial unique index that surfaces duplicate live-PSK INSERTs as pgerrcode 23505. The repository classifies it onto peers.ErrPSKAlreadyAssigned.
  • psk_wrap_key_idx ON plexsphere.psk (domain_id, wrap_key_version) WHERE retired_at IS NULL — backs the rotation sweep that lists every PSK still wrapped under a given (domain_id, wrap_key_version) so a future rotation pass can re-wrap in batches without a sequential scan.

plexsphere.peer_relay_assignment — relay-fallback assignment row

The relay-fallback aggregate the Mesh Fabric sub-story attaches to: one live row per Peer aggregate carrying the bridge Node that serves as the peer's relay-fallback, the (inet, port) pair the plexd agent dials when the direct WireGuard handshake times out, and the wall-clock window (assigned_at, retired_at) the assignment was active in. Soft-retired rows accumulate alongside live rows so a forensic replay can reconstruct "which bridge served this peer at 03:14 UTC". At most one live row per Peer is enforced by the partial unique index peer_relay_assignment_live_uq, mirroring the psk_live_peer_uq pattern. Added by migration 0030_peer_relay_assignment.sql.

ColumnTypeMeaning
iduuid PRIMARY KEYSurrogate aggregate id (UUIDv7).
domain_iduuid NOT NULLFK to plexsphere.domains(id) ON DELETE RESTRICT. Anchors the assignment to its tenant.
peer_iduuid NOT NULLFK to plexsphere.peer(id) ON DELETE RESTRICT. Names the Peer aggregate the assignment is bound to. RESTRICT prevents an in-flight Peer deletion from orphaning a live relay row; the Deregister flow runs first.
bridge_node_iduuid NOT NULLFK to plexsphere.nodes(id) ON DELETE RESTRICT. Names the bridge Node that serves as the peer's relay-fallback. RESTRICT prevents a bridge teardown from dropping live assignments out from under their peers — the relay-assigner sweep must retire every dependent assignment first.
fallback_endpointinet NOT NULLHost portion of the dial address plexd uses when the direct handshake times out. Sourced from the selected bridge Node's last_endpoint (or its mesh_ip when no NAT observation has landed yet). The (inet, integer) split lets an operator triage a relay outage with a real containment query (fallback_endpoint << inet '10.0.0.0/8') rather than a regex hack on a text column.
fallback_endpoint_portinteger NOT NULL CHECK (fallback_endpoint_port BETWEEN 1 AND 65535)Port portion paired with fallback_endpoint. Defaults to peers.DefaultBridgeRelayPort (51820) when no per-bridge override has been observed; a future story will surface a per-bridge configured listen port.
assigned_attimestamptz NOT NULL DEFAULT now()Wall-clock instant the assignment became live.
retired_attimestamptz NULLSoft-retirement tombstone. NULL means "live"; non-NULL means "retired" but kept for forensic replay of the per-peer relay chain.

Constraints and indexes:

  • peer_relay_assignment_retired_after_assigned CHECK (retired_at IS NULL OR retired_at >= assigned_at) — time-monotonic retirement, mirroring the psk_retired_after_issued shape.
  • peer_relay_assignment_live_uq UNIQUE INDEX ON plexsphere.peer_relay_assignment (peer_id) WHERE retired_at IS NULL — partial unique index that surfaces duplicate live-INSERTs as pgerrcode 23505. The Manager always soft-retires the prior row inside the same transaction before upserting a fresh assignment; a 23505 here is a programmer error, not a steady-state outcome.
  • peer_relay_assignment_bridge_idx ON plexsphere.peer_relay_assignment (bridge_node_id) WHERE retired_at IS NULL — partial helper index backing the relay assigner's bridge-churn sweep ("every peer whose currently- assigned bridge just transitioned to unreachable"). The partial predicate keeps soft-retired tombstones out of the index so each sweep tick stays a cheap b-tree range scan rather than a sequential scan across every historical assignment.

The migration's Down block refuses the downgrade with SQLSTATE 0A000. The assignment history is the forensic evidence chain that links every peer-to-bridge relay decision to the wall-clock window the assignment was active in; dropping the table is a regulatory- retention regression mirroring the 0012_peers and 0008_node_secret_keys stance. Operators performing a legitimate wipe-and-reinstall must drop the Postgres database itself; migrations.Down is not a destructive teardown tool.

plexsphere.peer_key_rotation — KeyRotation aggregate row

The key-rotation aggregate the Key Rotation Workflow sub-story attaches to: one row per key-rotation attempt a Node makes, carrying the lifecycle the rotation walks (pendingcompleted, or pendingsuperseded when a fresh rotation request pre-empts an in-flight one), the wall-clock window (requested_at, completed_at) the attempt spanned, and the before/after WireGuard public keys so a forensic replay can reconstruct which public key a Node presented at any past instant. At most one pending row per Node is enforced by the partial unique index peer_key_rotation_pending_uq, mirroring the psk_live_peer_uq pattern. Added by migration 0032_peer_key_rotation.sql. See ./key-rotation.md for the full lifecycle, the two-channel dispatch model, and the rotation-ordering invariants.

ColumnTypeMeaning
iduuid PRIMARY KEYSurrogate aggregate id (UUIDv7). Stamped onto the outbox row's aggregate_id.
domain_iduuid NOT NULLFK to plexsphere.domains(id) ON DELETE RESTRICT. Carries the Domain isolation boundary.
node_iduuid NOT NULLFK to plexsphere.nodes(id) ON DELETE RESTRICT. The Node whose mesh key is rotating; the column the partial unique index keys on. RESTRICT keeps the rotation evidence chain intact across a Node teardown.
peer_iduuid NOT NULLFK to plexsphere.peer(id) ON DELETE RESTRICT. The live Peer aggregate the rotation is bound to.
statetext NOT NULL CHECK (state IN ('pending', 'completed', 'superseded'))The lifecycle discriminator. The SQL CHECK pins the closed set so a mis-wired INSERT fails at the database rather than promoting an unknown value into the domain.
requested_attimestamptz NOT NULL DEFAULT now()Wall-clock instant the rotation was requested.
previous_public_keybytea NULL CHECK (previous_public_key IS NULL OR octet_length(previous_public_key) = 32)The Node's Curve25519 public key before the rotation. NULL when the caller did not supply it.
new_public_keybytea NULL CHECK (new_public_key IS NULL OR octet_length(new_public_key) = 32)The Node's Curve25519 public key after the rotation. NULL while the row is pending; the completion flow stamps it.
completed_attimestamptz NULLWall-clock instant the rotation completed. NULL marks the row in flight.

Constraints and indexes:

  • peer_key_rotation_completed_after_requested CHECK (completed_at IS NULL OR completed_at >= requested_at) — time-monotonic completion, mirroring the psk_retired_after_issued and peer_relay_assignment_retired_after_assigned shapes.
  • peer_key_rotation_pending_uq UNIQUE INDEX ON plexsphere.peer_key_rotation (node_id) WHERE state = 'pending' — the partial unique index that is the durable name of the one-pending-rotation-per-Node invariant. A second rotation request against a Node with an in-flight rotation first flips the prior row to superseded inside the same transaction so the index admits the fresh pending row.

The migration's Down block refuses the downgrade with SQLSTATE 0A000. The rotation history is the forensic evidence chain that links every Node WireGuard public-key transition to the wall-clock window the rotation spanned; dropping it is a security-audit regression mirroring the 0012_peers, 0029_peer_endpoint, and 0030_peer_relay_assignment stance.

Event types

The Manager emits exactly six outbox event types — no more, no less. The literals below are the bare snake_case strings the relay routes peer events on; downstream consumers switch on these exact literals. The closed set is enforced by the AST gate at ../../../tests/workspace/peers_event_type_set_test.go; new peer-side event types must land in ../../../internal/identity/tenancy/events/events.go first and be added to the gate's allow-list.

The exact literals the wire contract pins are:

  • peer_registered
  • peer_psk_assigned
  • peer_deregistered
  • peer_endpoint_changed
  • rotate_keys
  • peer_key_rotated

Five of the six — every literal except rotate_keys — are past-tense state-change notifications: something has happened to a Peer aggregate and the bus carries the fact. rotate_keys is the lone exception: it is an imperative command addressed to a single Node, telling it to generate a fresh Curve25519 keypair and call POST /v1/keys/rotate. The wire shape is identical (an outbox row routed on its bare snake_case event_type); only the semantics differ. See ./key-rotation.md for the command's role in the rotation workflow.

The bare snake_case form is deliberate. The earlier tenancy events use the aggregate-prefixed CamelCase form (tenancy.NodeRegistered etc.); the wire-contract spec explicitly fixes the bare snake_case strings as the closed set the relay routes peer events on, so the new constants land in this form to satisfy the spec without churning the prior wire contract.

peer_registered

Constructed by events.NewPeerRegistered; constants TypePeerRegistered = "peer_registered". Emitted by Manager.Register when a fresh Peer aggregate lands for a (DomainID, NodeID) pair. The payload deliberately omits any key material — the wire-shape spec is "ID triple plus event metadata, plus the additive relay-fallback hint when a bridge candidate is available".

JSON payload shape:

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>",
  "fallback_endpoint": "<host:port, omitted when no bridge candidate is available>"
}

Validation invariants on the constructor: peer_id, domain_id, and node_id must all be non-zero. A zero now defaults to time.Now().UTC(); a non-zero now is coerced to UTC.

The fallback_endpoint field is additive and optional: a non-empty value carries the dial address the plexd agent uses when the direct WireGuard handshake times out, sourced from the bridge Node the relay-assigner chose for this peer; an empty value is dropped from the JSON wire-form by the omitempty tag so a legacy consumer that pre-dates the relay-fallback surface observes the original 5-field payload shape byte-for-byte. The empty-string case is the canonical no-bridge- candidate signal — see ./relay-fallback.md for the chooser's decision tree and the composition of the empty fallback_endpoint with the (separate) endpoint-snapshot field on the silent-unreachability path.

peer_psk_assigned

Constructed by events.NewPeerPSKAssigned; constant TypePeerPSKAssigned = "peer_psk_assigned". Emitted by Manager.AssignPSK after the Manager has wrapped the PSK plaintext and persisted the row. The payload deliberately carries NO ciphertext and NO plaintext — only kid and wrap_key_version travel on the bus. A bus consumer that needs to unwrap the PSK reads the canonical wrapped row out of plexsphere.psk and threads (kid, wrap_key_version) from the event into the unwrap call. This keeps the "no plaintext on the wire" invariant load-bearing at the event boundary, not just at the storage boundary.

JSON payload shape:

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>",
  "kid": "<wrap-key identifier, e.g. psk-software-v1>",
  "wrap_key_version": 1
}

Validation invariants on the constructor: peer_id, domain_id, and node_id must all be non-zero; kid must be non-empty (after trim); wrap_key_version must be >= 1 because the SQL CHECK on plexsphere.psk.wrap_key_version requires a positive integer.

peer_deregistered

Constructed by events.NewPeerDeregistered; constant TypePeerDeregistered = "peer_deregistered". Emitted by Manager.Deregister after the soft-delete UPDATE has stamped deregistered_at. The payload carries the timestamp the SQL UPDATE stamped onto the row so consumers can distinguish "was deregistered now" from "was already deregistered before this replay" without joining back to plexsphere.peer.

JSON payload shape:

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>",
  "deregistered_at": "<RFC3339Nano UTC>"
}

Validation invariants on the constructor: peer_id, domain_id, and node_id must all be non-zero; deregistered_at must be set — a soft-delete with a zero timestamp would silently desync the bus from the SQL row.

peer_endpoint_changed

Constructed by events.NewPeerEndpointChanged; constant TypePeerEndpointChanged = "peer_endpoint_changed". Emitted by the endpoint-intake handler on the first observation, on every subsequent observation whose (host, port) differs from the persisted value, on an un-stale transition where the prior row was tombstoned, and by the endpoint sweeper as a stale-tombstone signal (endpoint == "" in that case). The payload carries the new host:port string and the prior host:port so a downstream consumer can distinguish "first observation" (previous_endpoint is empty), "endpoint changed" (both are non-empty and differ), and "observation went stale" (endpoint is empty) without joining back to plexsphere.peer. See ./endpoints.md for the intake surface and the sweeper's tick contract.

JSON payload shape:

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>",
  "endpoint": "<host:port, or empty on the stale-tombstone signal>",
  "endpoint_reported_at": "<RFC3339Nano UTC>",
  "previous_endpoint": "<host:port, or empty on the first observation>",
  "fallback_endpoint": "<host:port, omitted when no bridge candidate is available>"
}

Validation invariants on the constructor: peer_id, domain_id, and node_id must all be non-zero; endpoint_reported_at must be set — a zero observation timestamp would silently desync the bus from the SQL row. The endpoint, previous_endpoint, and fallback_endpoint strings MAY be empty: an empty endpoint is the stale-tombstone signal emitted by the sweeper when an observation passes TTL without refresh, an empty previous_endpoint is the first-observation signal where no prior endpoint exists, and an empty fallback_endpoint is the no-bridge-candidate signal where the relay-assigner could not find a healthy bridge to compose with the observation. Handler-side parsing of host:port lives at the intake boundary; the constructor deliberately does not re-parse so the same payload shape carries both a live observation and a stale tombstone.

The fallback_endpoint field is additive and optional in the same sense as on peer_registered above: a non-empty value carries the dial address the plexd agent uses when the direct WireGuard handshake times out; an empty value is dropped from the JSON wire-form by the omitempty tag so a legacy consumer that pre-dates the relay-fallback surface observes the original 8-field payload shape byte-for-byte. The two empty-string fields compose: endpoint == "" AND fallback_endpoint == "" is the silent-unreachability signal — the direct path has gone stale AND no relay-fallback is available, so the plexd agent has nowhere left to dial. See ./relay-fallback.md for the relay- assigner's decision tree and the composition rules.

rotate_keys

Constructed by events.NewRotateKeys; constant TypeRotateKeys = "rotate_keys". Appended by Manager.RequestRotation when an operator requests a mesh-key rotation for a Node. Unlike the five past-tense peer notifications, rotate_keys is an imperative command dispatched to the addressed Node's per-node SSE subject leaf — it instructs the Node to generate a fresh Curve25519 keypair and call POST /v1/keys/rotate. The payload carries only the (Peer, Domain, Node) ID triple; the command needs no further parameters because the Node owns its own keypair generation.

JSON payload shape:

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>"
}

Validation invariants on the constructor: peer_id, domain_id, and node_id must all be non-zero — rotate_keys is addressed at a single Node, so a zero ID would dispatch the command at no Node at all. A zero now defaults to time.Now().UTC(); a non-zero now is coerced to UTC.

peer_key_rotated

Constructed by events.NewPeerKeyRotated; constant TypePeerKeyRotated = "peer_key_rotated". Appended by Manager.CompleteRotation as the last write of the rotation transaction, so a consumer observing the event is guaranteed every aggregate state change it describes has already landed. Every peer of the rotated Node observes it so each peer can update its WireGuard configuration with the rotated Node's new public key and the re-issued PSK reference. The payload carries the new public key plus the (kid, wrap_key_version) reference to the re-issued PSK row, but deliberately carries NO PSK plaintext and NO ciphertext — exactly like peer_psk_assigned. A consumer that needs the PSK reads the canonical wrapped row out of plexsphere.psk and threads (kid, wrap_key_version) into the unwrap call.

JSON payload shape:

json
{
  "event_id": "<uuidv7>",
  "occurred_at": "<RFC3339Nano UTC>",
  "peer_id": "<uuidv7>",
  "domain_id": "<uuidv7>",
  "node_id": "<uuidv7>",
  "new_public_key": "<base64 of the 32-byte Curve25519 public key>",
  "kid": "<wrap-key identifier, e.g. psk-software-v1>",
  "wrap_key_version": 1
}

Validation invariants on the constructor: peer_id, domain_id, and node_id must all be non-zero; new_public_key must be exactly 32 bytes (a wrong-length value is a Curve25519 key the Node never generated); kid must be non-empty (after trim); wrap_key_version must be >= 1. The constructor stores a defensive copy of new_public_key so a caller mutation cannot reach into the event after construction. See ./key-rotation.md for the rotation lifecycle this event terminates.

Encrypt-at-rest envelope

The *softwareWrapper in ../../../internal/mesh/peers/wrapper.go is the dev-only AES-256-GCM PSK wrapper. The wire layout the wrapper emits and the unwrap path parses is:

text
<12-byte nonce> || <ciphertext + 16-byte GCM tag>

The constants pinned at the package boundary:

  • wrapKeySize = 32 — the AES-256 key length. A misconfigured Secret mount surfaces as ErrWrapKeyMisconfigured from NewSoftwareWrapper, not as a confused mid-Wrap panic .
  • gcmNonceSize = 12 — the standard AES-256-GCM nonce length per NIST SP 800-38D §5.2.1.1 (96 bits). Pinning it as a package-level constant keeps the layout self-documenting.
  • GCM tag length is the cipher-suite-fixed 16 bytes; the floor ciphertext (empty plaintext) is therefore 12 + 0 + 16 = 28 bytes, which is exactly the lower bound the SQL CHECK on plexsphere.psk.wrapped_psk accepts (length BETWEEN 28 AND 4096).

Wrap(ctx, plaintext):

  1. Reads a fresh 12-byte nonce from the wrapper's random source (crypto/rand.Reader in production; tests inject deterministic readers for stable per-call assertions).
  2. Calls gcm.Seal with no additional-authenticated-data (AAD) — see the DECISION block on AAD below.
  3. Returns (<nonce>||<ct+tag>, kid, wrapKeyVersion, nil).

Unwrap(ctx, wrapped, kid, wrapKeyVersion):

  1. Rejects buffers shorter than gcmNonceSize + gcm.Overhead() (28 bytes) by wrapping ErrPSKNotFound so the caller can branch on "row present but unwrappable" the same way it branches on "row missing".
  2. Logs a slog.Warn if the supplied (kid, wrapKeyVersion) diverges from the wrapper's bound values but does NOT short-circuit the GCM-Open — the "retired wrap key tolerated for read" path leans on this tolerance during a rotation.
  3. Calls gcm.Open. A tag-check failure surfaces wrapping ErrPSKNotFound so the caller distinguishes "missing row" from "row present but the bytes don't authenticate".

DECISION (carried over from wrapper.go): kid and wrap_key_version are NOT bound into the AAD. Binding them would force a rotation flow to re-wrap every PSK in place rather than carry the (kid, version) tuple in the row, and the "retired wrap key tolerated for read" rule implies the row's persisted version may not match the wrapper's currently-active version. The tolerant unwrap matches that requirement; the warn-log surfaces drift to operators without blocking the read.

The Manager additionally fail-closes on a wrapper-version / active-row divergence at write time: Manager.AssignPSK resolves the active wrap-key row first, calls Wrapper.Wrap, and then asserts version == active.Version. A drift here means the boot wiring is misconfigured (or a rotation is mid-flight) and persisting a row whose wrap_key_version FK references a different version than the bytes were sealed under would silently corrupt the unwrap path; the Manager refuses with ErrNoActiveWrapKey rather than write the divergence.

The software wrapper emits exactly one slog.Warn at construction time announcing its dev-only posture. The exact line the constructor emits (pinned in bootWarningMessage so tests can assert on the substring without re-specifying the text) is:

peers: software PSK wrapper is dev-only and MUST NOT be used in production

A future HSM-backed Wrapper adapter slots in here without touching Manager.

Threat model

The peers sub-context defends six classes of attack along the register-assign-deregister path. Each mitigation is implemented in one named place so a reader chasing a security claim does not have to assemble it from multiple files.

  1. Plaintext PSK exfiltration via the database. Plaintext PSK bytes never reach plexsphere.psk. The Manager calls Wrapper.Wrap exactly once per AssignPSK; only the <nonce>||<ct+tag> envelope is persisted, alongside the (kid, wrap_key_version) tuple needed to identify the wrap key. The SQL CHECK on plexsphere.psk.wrapped_psk (length BETWEEN 28 AND 4096) catches the "stored a 32-byte plaintext PSK by mistake" branch coincidentally; the broader gate is the Wrapper interface itself, which refuses to return less than nonce+tag overhead.
  2. Plaintext PSK exfiltration via the bus. The peer_psk_assigned event payload deliberately carries NO ciphertext and NO plaintext — only kid and wrap_key_version travel on the wire. A bus consumer that needs the wrapped bytes reads plexsphere.psk directly. The constants TypePeerPSKAssigned = "peer_psk_assigned" and the PeerPSKAssigned payload struct in ../../../internal/identity/tenancy/events/events.go pin the wire shape so a future contributor cannot accidentally widen it.
  3. Plaintext PSK exfiltration via in-process state.Manager.AssignPSK defers destroyPlaintext(cmd.Plaintext) so the caller's plaintext slice is wiped on every return path — error or success. The wrapper takes its own copy during Wrap, so zeroing the original is the last opportunity for the Manager to scrub the secret. The Manager's structured-log line for AssignPSK deliberately logs only (peer_id, domain_id, kid, wrap_key_version) — the plaintext bytes never reach slog and the wrapped ciphertext is not interesting to operators tailing the log.
  4. Wrap-key drift / rotation footgun. Manager.AssignPSK fail-closes on a wrapper-version vs active-row divergence (see the encrypt-at-rest section above) so a misconfigured boot wiring or a mid-flight rotation cannot persist a PSK row whose wrap_key_version FK references a different version than the bytes were sealed under. The composite FK psk_wrap_key_fk (domain_id, wrap_key_version) REFERENCES plexsphere.domain_nsk_wrap_key (domain_id, version) adds a second line of defence at the SQL layer: a PSK row can never reference a wrap-key version that does not exist for its Domain .
  5. Retired-key intolerance. Wrapper.Unwrap tolerates a (kid, wrap_key_version) that diverges from the wrapper's currently-bound values — it emits a slog.Warn but proceeds with gcm.Open. The "retired wrap key tolerated for read" path leans on this so that a rotation in flight does not immediately invalidate every plexd-side WireGuard handshake the relay is trying to deliver. AAD is deliberately NOT bound to (kid, wrap_key_version) for the same reason (see the encrypt-at-rest DECISION above).
  6. Audit-trail loss. The migration's Down block REFUSES the downgrade with SQLSTATE 0A000 (feature_not_supported). The wrapped-PSK ciphertext chain links every peer-to-peer WireGuard handshake to the wrap-key version that produced the ciphertext; dropping the tables is a security regression that mirrors the NSK invariant 0008 protects. Soft-delete on both peer and psk keeps the rows available for forensic replay; a hard DELETE would orphan the outbox events that consumers replay .

Sentinel errors callers branch on (defined in ../../../internal/mesh/peers/errors.go):

SentinelPathCaller branch
ErrPeerAlreadyRegisteredPre-flight read or 23505 race in RegisterIdempotent re-registration
ErrPeerAlreadyDeregisteredNo-op SoftDeletePeer UPDATE in DeregisterIdempotent re-deregistration
ErrPSKAlreadyAssigned23505 against psk_live_peer_uq in AssignPSKRefuse double-assign
ErrNoActiveWrapKeyNo state = 'active' row, 23503 against psk_wrap_key_fk, or wrapper-version driftFail-closed; surface misconfiguration
ErrNilManagerCollaboratorNewManager with nil Store or WrapperBoot-time misconfig alert
ErrPeerNotFoundpgx.ErrNoRows on GetPeerByID / GetPeerByDomainAndNodeTranslated to ErrPeerAlreadyRegistered / ErrPeerAlreadyDeregistered by Manager paths
ErrPSKNotFoundpgx.ErrNoRows on GetActivePSKByPeer or short/tag-failed buffer in UnwrapDistinguish "no live PSK" from a hard repository failure
ErrWrapKeyMisconfiguredNewSoftwareWrapper with non-32-byte WrapKeyBoot-time misconfig alert
ErrEndpointObservationStaleManager.UpdateEndpoint when reported_at falls outside the per-Domain endpoint_ttl windowSurface 400 with a dedicated problem-code so observation-replay attempts dashboard separately from clock-skew rejections
ErrEndpointUnparseableManager.UpdateEndpoint when the supplied endpoint string is not a parseable host:portRefuse the malformed PUT at the boundary; intake handler surfaces 400 endpoint_unparseable
ErrEndpointClockSkewManager.UpdateEndpoint when reported_at drifts past MaxEndpointSkew of the server's wall clockRefuse before any aggregate write so a forged or drifted client clock cannot smuggle a misdated observation past the freshness invariant

Depguard rationale

The peers sub-context lives at internal/mesh/peers and is governed by a dedicated depguard rule (no-cross-context-imports-mesh-peers) in ../../../.golangci.yml. The pattern mirrors the existing no-cross-context-imports-mesh-sse and no-cross-context-imports-mesh-reachability carve-outs — one rule per allowing-context, single allow list overriding the broader internal/identity deny entry. The generic no-cross-context-imports rule excludes internal/mesh/peers/** from its files list so the two rules do not collide.

The peers publish pipeline legitimately needs a narrow set of cross-context imports the rule permits:

  • internal/accessSignerClient (Ed25519 envelope signing) for the follow-on integration that extends the publish path with operator audit signatures.
  • internal/audit — append-only audit row for peer registration, PSK assignment, and deregistration decisions.
  • internal/signing (and subtree like internal/signing/envelope) — the canonical Envelope value type and its hash primitives.
  • internal/identity/tenancy — the strongly typed tenancy.ID Domain/Node/Peer identifiers the Manager threads through every command. Allow-listed explicitly to override the broader internal/identity deny entry.
  • internal/identity/tenancy/events — the peer_registered / peer_psk_assigned / peer_deregistered event constructors and payload structs the Manager emits. Allow-listed explicitly.
  • internal/identity/nodes/nsk — wrap-key envelope precedent the PSK custody path mirrors. Allow-listed explicitly.
  • internal/mesh/sse — the Signed Event Bus the peers Manager hands envelopes to.

internal/access, internal/audit, internal/signing, and internal/mesh/sse are intentionally absent from the deny list (they need no allow entry); listing them in the allow: block would be redundant. The three internal/identity/... allow entries are the load-bearing ones — without them the broader internal/identity deny entry would block the Manager from importing the strongly typed IDs and the event constructors it cannot function without .

The workspace alignment test ../../../tests/workspace/mesh_peers_depguard_test.go parses .golangci.yml and asserts that the carve-out's allow-list matches the documented seven import roots above. A drift between this document, the code's actual imports, and the depguard rule fails CI before review.

Cross-references

  • ./relay-fallback.md — the relay-fallback assignment surface that owns the plexsphere.peer_relay_assignment aggregate and the additive fallback_endpoint field on the peer_registered and peer_endpoint_changed payloads. The chooser's decision tree, the bridge-candidate ranking heuristic, the silent-unreachability signal (empty endpoint AND empty fallback_endpoint), and the relay-assigner sweep that recomputes assignments on a bridge reachability transition all live there.
  • ./key-rotation.md — the key-rotation workflow that owns the plexsphere.peer_key_rotation aggregate and the rotate_keys / peer_key_rotated event types. The RequestRotation / CompleteRotation Manager commands, the pending-to-completed-or-superseded lifecycle, the two-channel SSE-plus-heartbeat dispatch model, the rotation-ordering invariants, and the no-sweeper rationale all live there.
  • ./endpoints.md — the NAT endpoint intake surface that drives Manager.UpdateEndpoint and the per-Peer endpoint snapshot the relay-assigner reads when composing a peer_endpoint_changed payload.
  • ../../contexts/mesh/sse.md — the Signed SSE Event Bus the peers Manager hands envelopes to. Same audience and quadrant; the wire-level Type discriminator the bus pins (node_state_updated) is the envelope-shape consumer of the six peer event types this document defines.
  • ../../how-to/mesh/operate-peers.md — operator how-to for inspecting the live plexsphere.peer and plexsphere.psk rows, the wrap-key ledger, and the publisher metrics. Lands with the follow-on operator integration; the link is forward-referenced so this document stays the entry point once the how-to ships.
  • ../../../internal/mesh/peers/doc.go — package-level overview and the load-bearing DECISION blocks (sub-context boundary, Manager / Wrapper / Repository triple, closed event-type set).
  • ../../../.golangci.yml — the no-cross-context-imports-mesh-peers depguard rule pinning the allow-list of seven cross-context import roots.
  • ../../../tests/workspace/mesh_peers_depguard_test.go — workspace alignment test that asserts the depguard rule's allow-list matches the documented set above.
  • ../../reference/cli/plexctl/peer.md, ../../reference/cli/plexctl/mesh.md — the matching plexctl per-command references.