Appearance
Per-peer NAT endpoint intake
This document is the authoritative bounded-context reference for the NAT endpoint intake surface that ships under ../../../internal/mesh/peers/ and is exposed through PUT /v1/nodes/{id}/endpoint. It covers the ubiquitous language, the wire envelopes, the per-Peer endpoint snapshot schema, the per-Domain endpoint TTL policy, the stale-endpoint sweeper, the peer_endpoint_changed outbox event, the clock-skew admission rule, the six-gate handler ordering, the audit contract, the OpenAPI cross-reference, and the threat model.
The endpoint intake surface is the plexd → plexsphere NAT observation channel: a per-Node, NSK-authenticated PUT endpoint that stamps the agent's observed (host, port) tuple onto the addressed Peer aggregate and emits a peer_endpoint_changed event whenever the tuple differs from the prior observation or transitions out of a stale window. Anything outside that surface — the Signed SSE Event Bus that fans out the event payload, the reachability evaluator that owns the per-Node liveness verdict, the Key & Peer Manager's Register / AssignPSK / Deregister commands, the relay-fallback assignment surface that consumes the endpoint snapshot — is a collaborator the endpoint sub-context emits to or projects from, not a concern of this document. The three surfaces this document pins are the in-process command pipeline (Manager.UpdateEndpoint), the read-only side (Provider.SnapshotEndpointsForDomain), and the periodic stale-endpoint sweeper (EndpointSweeper.Sweep).
Status — partial delivery
The writer-side closure for the endpoint surface is load-bearing in the production binary:
- Per-Peer aggregate write —
Manager.UpdateEndpointresolves the prior endpoint snapshot, gates the observation through the clock-skew and per-Domain TTL windows, runs theUpdatePeerEndpointUPDATE inside a singleStore.InTx, and conditionally appends apeer_endpoint_changedoutbox event in the same transaction. - Stale-endpoint sweeper —
EndpointSweeper.Sweeppaginates thepeer_endpoint_stale_idxpartial index by per-Domain TTL, stampslast_endpoint_stale_aton each tombstoned row, and appends apeer_endpoint_changedevent with an emptyendpointstring per row. The sameStore.InTxboundary guards the batch. - HTTP intake —
PUT /v1/nodes/{id}/endpointruns the six-gate ordering (NSK auth, path-id, body cap, body decode, clock-skew, AddrPort parse) before invoking the recorder. Theendpoint_dispatch.gofail-closed gate returns501 endpoint_not_provisioneduntil every load-bearing port is wired.
One downstream surface remains deferred: the SSE relay drain for peer_endpoint_changed that surfaces the outbox row on GET /v1/nodes/{id}/events, tracked in ../../architecture/mesh-event-bus-roadmap.md. The outbox row is durable on every transition today; the wire envelope reaches a subscriber once the relay drain lands.
Load-bearing today (each detailed in its own section below): the Manager.UpdateEndpoint command path, the EndpointSweeper goroutine, the Provider.SnapshotEndpointsForDomain read path, the handler with its six refusal codes, the 0029_peer_endpoint.sql migration (four Peer columns, one Domain column, the peer_endpoint_stale_idx partial index), the MaxEndpointSkew golden constant, the per-Domain EndpointPolicy value object with its 30s floor and 1h ceiling, and the OpenAPI surface. The exact file references are in Cross-references.
Downstream consumers (relay-fallback assignment, audit log, dashboard endpoint badges) MUST treat the deferred SSE drain as scaffolding: every wire shape is stable from day one, and the peer_endpoint_changed outbox row is the single dependency they consume the observation stream from rather than the read-side projection.
Cross-references
./peers.md— the Key and Peer Manager that owns the Peer aggregate the endpoint snapshot rides on. The endpoint surface extends the same aggregate with the(last_endpoint, last_endpoint_port, last_endpoint_reported_at, last_endpoint_stale_at)quadruple; thepeer_endpoint_changedevent sits inside the closed six-event peer-side set (peer_registered,peer_psk_assigned,peer_deregistered,peer_endpoint_changed,rotate_keys,peer_key_rotated) documented there../relay-fallback.md— the relay-fallback assignment surface that composes the endpoint snapshot with the per-peer bridge-node assignment. Thefallback_endpointfield onpeer_endpoint_changed(and onpeer_registered) is the additive optional dial address plexd uses when the direct WireGuard handshake times out; the silent-unreachability signal — the composite whereendpoint == ""ANDfallback_endpoint == ""— is documented there. The endpoint-stale-tombstone path emitted by theEndpointSweeperand the relay-assigner's bridge-churn sweep are independent observers of the same per-peer state, so a silent-unreachability transition can be reconstructed from either surface's outbox stream../reachability.md— the per-Node liveness sibling. The endpoint intake and the heartbeat intake share the NSK middleware, theMaxEndpointSkew/MaxHeartbeatSkew60-second bound, and the path-id defense-in-depth gate; the two surfaces are independent observation channels on the same per-Node admission envelope../sse.md— the Signed SSE Event Bus thepeer_endpoint_changedoutbox row travels onto once the relay drain lands. The wire-levelTypediscriminator the bus pins (node_state_updated) is the envelope-shape consumer of the endpoint event.../../architecture/mesh-event-bus-roadmap.md— the deferred-wiring tracker that records the SSE relay's production-wiring gap the endpoint outbox-emitter currently rides on.../../how-to/mesh/operate-peers.md— operator how-to for inspecting the liveplexsphere.peerrows, the per-Domainendpoint_ttlknob, and the sweeper metrics. The same operator guide that covers Peer registration also covers the endpoint observation snapshot.../../../api/openapi/plexsphere-v1.yaml— OpenAPI spec forPUT /v1/nodes/{id}/endpoint, including theEndpointRequest,EndpointResponse,Problem,PermissionDeniedschemas and the six refusal codes.../../../internal/mesh/peers/endpoint.go—Manager.UpdateEndpoint, theUpdateEndpointCmd/UpdateEndpointResultshapes, therenderEndpointcanonical "host:port" formatter, and theDomainPolicyLookupport.../../../internal/mesh/peers/endpoint_sweeper.go—EndpointSweeper, the tick loop, the per-tick batch cap, and theEndpointSweeperCountermetrics port.../../../internal/mesh/peers/provider.go—Provider.SnapshotEndpointsForDomainand theNodePeerEndpointread-side projection.../../../internal/identity/tenancy/domain.go— the per-DomainEndpointPolicyvalue object, theendpointTTLMinInterval/endpointTTLMaxIntervalfloor/ceiling constants, andDefaultEndpointPolicy.../../../internal/identity/tenancy/events/events.go— thePeerEndpointChangedpayload struct, theTypePeerEndpointChanged = "peer_endpoint_changed"constant, and theNewPeerEndpointChangedconstructor.../../../internal/platform/clock/skew.go— theMaxEndpointSkewconstant and theWithinEndpointSkewhelper.../../../internal/platform/db/migrations/0029_peer_endpoint.sql— the schema migration adding the four Peer columns, the per-Domainendpoint_ttlcolumn, and thepeer_endpoint_stale_idxpartial index.../../../internal/transport/http/v1/handlers/endpoint.go— the six-gate handler body forputNodeEndpoint.../../../internal/transport/http/v1/handlers/endpoint_deps.go— theEndpointRecorder/PeerLookupports, the problem-code constants, the audit-relation constants, and theEndpointMaxBodyBytes4 KiB cap.../../../internal/transport/http/v1/handlers/endpoint_dispatch.go— the fail-closed gate that returns501 endpoint_not_provisioneduntil every load-bearing port is wired.../../../cmd/plexsphere/peers_factory_prod.go— the composition root that wiresManager.UpdateEndpointbehind theEndpointRecorderadapter, the pool-boundPeerLookupadapter, and theEndpointSweepergoroutine.
Ubiquitous language
The terms below travel verbatim across the Go code, the SQL schema, the OpenAPI contract, the outbox event payloads, the audit log, the structured-log attributes, and the per-Domain audit trail. Keep the names stable in error messages and operator-facing tooling.
| Term | Definition |
|---|---|
| Endpoint observation | A (host, port) tuple plexd reports via PUT /v1/nodes/{id}/endpoint. The wire envelope is EndpointRequest{endpoint, nat_type, reported_at}. The ingestion stamps the (host, port, reported_at) triple on the addressed Peer row. |
| Endpoint | The publicly-observed host:port wire form. Stored as the typed pair (last_endpoint inet, last_endpoint_port int) on plexsphere.peer; the wire string is composed at the boundary by renderEndpoint, which brackets IPv6 hosts via netip.AddrPort.String() semantics. |
| reported_at | The wall-clock instant the agent attached to its PUT body. Compared against the server's now via clock.WithinEndpointSkew — |delta| > MaxEndpointSkew rejects with 400 endpoint_clock_skew. Also compared against now() - TTL to refuse observations older than the per-Domain freshness window. |
| Stale | An endpoint observation older than now() - EndpointPolicy.TTL for its Domain. The sweeper stamps last_endpoint_stale_at on the Peer row and emits a peer_endpoint_changed event with an empty endpoint string so downstream consumers know to fall back to the relay assignment. |
| EndpointPolicy | The per-Domain value object ({TTL}) on the Tenancy aggregate that drives the freshness window. Persisted as the endpoint_ttl INTERVAL column on plexsphere.domains. |
| EndpointPolicy.TTL | The per-Domain endpoint freshness window. Floor ≥ 30s (so a heartbeat-cadence agent always refreshes inside the window), ceiling ≤ 1h (mirrors the reachability policy ceilings). Default 5m. |
| MaxEndpointSkew | The inclusive upper bound on |client_reported_at − server_now| accepted by the endpoint handler. Pinned at 60 * time.Second in ../../../internal/platform/clock/skew.go against a golden literal; widening is a security regression, not a tuning knob. |
| NSK (Node Session Key) | The Node's per-session bearer credential, presented as Authorization: Bearer nsk_<env>_<...>. The NSKAuthenticator middleware resolves it to a tenancy.Node and refuses path-id ≠ resolved-Node-id with 403 node_id_mismatch. Shared with the heartbeat surface. |
| Manager.UpdateEndpoint | The sole command-side entry point that stamps a fresh observation onto the Peer aggregate and conditionally emits a peer_endpoint_changed event. Lives in ../../../internal/mesh/peers/endpoint.go. |
| EndpointSweeper | The periodic background ticker that paginates the peer_endpoint_stale_idx partial index, stamps last_endpoint_stale_at on each row past TTL, and emits a tombstone peer_endpoint_changed event per stamped row. Lives in ../../../internal/mesh/peers/endpoint_sweeper.go. |
| Provider.SnapshotEndpointsForDomain | The read-side projection that returns []NodePeerEndpoint for every live peer in a Domain whose observation is fresh (last_endpoint_stale_at IS NULL and last_endpoint IS NOT NULL). The relay-fallback assignment surface consumes the snapshot. |
| DomainPolicyLookup | The narrow read seam Manager.UpdateEndpoint consults to resolve the per-Domain EndpointPolicy. Production wires a sqlc-bound adapter against plexsphere.domains.endpoint_ttl; a nil at construction falls back to a stub returning tenancy.DefaultEndpointPolicy. |
| EndpointRecorder | The transport-layer port the handler depends on to persist an observation. Production binds it to Manager.UpdateEndpoint via a small adapter in cmd/plexsphere; tests substitute a recording fake. |
| PeerLookup | The transport-layer read-only port that translates the path-id (a Node id) to the live Peer row's id. Held as a separate port from EndpointRecorder so the writer fake stays trivial; the production adapter is a raw pgx query against plexsphere.peer. |
| AuditRelationNodeEndpointIngest | The canonical Relation string the endpoint handler stamps on every ingestion-phase audit row: node_endpoint.record. Stable so audit consumers can filter by relation without parsing free-form text. |
| AuditRelationNodeEndpointPathGate | The canonical Relation string the endpoint handler stamps when its defense-in-depth path-id gate fires: node_endpoint.path_gate. Distinct from the ingestion relation so dashboards can alert on "middleware was bypassed but handler caught it" without conflating it with ingestion-phase entries. |
| peer_endpoint_changed | The bare snake_case outbox event literal the Manager and the sweeper emit. One member of the closed six-event peer-side set documented in ./peers.md — peer_registered, peer_psk_assigned, peer_deregistered, peer_endpoint_changed, rotate_keys, peer_key_rotated — the wire contract pins. |
Wire shape — PUT /v1/nodes/{id}/endpoint
The request envelope is the EndpointRequest value type defined in ../../../internal/transport/http/v1/handlers/endpoint.go, mirrored on the OpenAPI spec at ../../../api/openapi/plexsphere-v1.yaml under components.schemas.EndpointRequest.
| Field | Type | Meaning |
|---|---|---|
endpoint | string ("host:port") | The publicly-observed (host, port) tuple. Parsed via netip.ParseAddrPort — an unparseable string, a port outside 1..65535, or an unparseable IPv6 host rejects with 400 endpoint_unparseable. |
nat_type | string (cone / restricted / port_restricted / symmetric / unknown) | The agent-classified NAT traversal posture. The handler does NOT branch on the value — it is forwarded verbatim for downstream observability. |
reported_at | RFC 3339 timestamp | The agent's local wall-clock at observation time. Compared against the server's now via clock.WithinEndpointSkew — |delta| > MaxEndpointSkew rejects with 400 endpoint_clock_skew. Also compared against now() - TTL; an observation older than the per-Domain TTL rejects with the same 400 endpoint_clock_skew code (with an audit reason disambiguating "stale" from "skew"). |
The success response envelope is the EndpointResponse value type in the same file.
| Field | Type | Meaning |
|---|---|---|
accepted_at | RFC 3339 timestamp | The server-side admission instant the recorder stamped on the row. Matches last_endpoint_reported_at for the row just written. |
stale_after | RFC 3339 timestamp | accepted_at + EndpointPolicy.TTL. The agent uses this to schedule the next observation before the sweeper would tombstone the row. |
The six refusal codes the handler emits (Problem-Details code field):
| Code | HTTP | Trigger | Source file |
|---|---|---|---|
endpoint_clock_skew | 400 | Two distinct semantic flavours share one wire code. Clock-drift flavour — |client_reported_at − server_now| > MaxEndpointSkew (60s); the agent's local clock has drifted off server time. Stale-observation flavour — reported_at is fresh against the skew window but older than the per-Domain endpoint TTL; the agent's keep-alive cadence has lapsed. Operators disambiguate via the audit row's reason field — the recorder-side stale arm stamps reported_at older than per-Domain endpoint TTL while the handler-side drift arm stamps reported_at outside MaxEndpointSkew window. Agent remediation is the same in both flavours (refresh reported_at from a fresh time.Now() and retry), so the two share one wire code; widening the wire surface would add a code agents cannot act on differently. | endpoint.go step 5; endpoint.go handleEndpointRecordError for the recorder-side stale path |
endpoint_unparseable | 400 | netip.ParseAddrPort(req.Endpoint) fails (no colon, non-numeric port, port outside 1..65535, unparseable IPv6); OR the recorder's defense-in-depth aggregate invariant rejects the AddrPort. | endpoint.go step 6; endpoint.go handleEndpointRecordError |
malformed_endpoint_request | 400 | JSON decode fails (invalid JSON, unknown field, missing required field). | endpoint.go step 4 |
endpoint_body_too_large | 413 | Body exceeds EndpointMaxBodyBytes (4 KiB) — caught by http.MaxBytesReader before the JSON decoder ever sees the bytes. | endpoint.go step 3 |
node_id_mismatch | 403 | URL path id ≠ resolved NSK Node id (defense-in-depth; the NSK middleware also runs this gate). | endpoint.go step 2 |
endpoint_peer_not_found | 404 | PeerLookup.ResolveLivePeer reports the authenticated Node has no live Peer row in any Domain. | endpoint.go step 7 |
endpoint_peer_gone | 410 | The recorder returned ErrEndpointPeerAlreadyDeregistered — the Peer row was soft-deleted between the handler's PeerLookup admission and the recorder's UPDATE. | endpoint.go handleEndpointRecordError |
Two adjacent refusal arms are owned by collaborators rather than the handler body:
401 nsk_revoked— the NSK middleware in../../../internal/identity/authn/middleware/nsk.gorefuses a missing, malformed, or revoked NSK envelope before the handler runs.501 endpoint_not_provisioned— theendpoint_dispatch.gofail-closed gate fires when ANY ofEndpointRecorder,NSKResolver,NodeRepo,PeerLookupis nil at composition time, so a half-wired posture cannot admit traffic past gates that have not yet been wired.
Why 404 and 410 exist as endpoint-specific codes
endpoint_peer_not_found (404) and endpoint_peer_gone (410) are operator-visible response codes that complement the four-code refusal taxonomy quoted in the original task acceptance criteria. They are NOT scaffolding — both arms cover legitimate lifecycle states that the NSK-bearer surface must distinguish so a plexd agent can take the right remediation:
404 endpoint_peer_not_found— the NSK authenticates a Node the schema knows about, butPeerLookup.ResolveLivePeerreturns zero rows: the Node has not yet been bound to a Peer aggregate (a fresh enrollment whoseManager.Registerflow has not yet landed) or the Peer was drained out of the Domain after the NSK was minted. A 404 here tells the agent its NSK is valid but the Peer side of its lifecycle has not yet caught up — the agent retries on its bootstrap cadence rather than treating the observation as poisoned. Distinguishing this from a 403node_id_mismatch(cross-Node impersonation) lets a leaked NSK on a missing Peer surface differently from a leaked NSK on a sibling Peer.410 endpoint_peer_gone— the recorder returnedpeers.ErrPeerAlreadyDeregistered: the Peer existed when the handler's PeerLookup gate ran but was soft-deleted between the admission read and the recorder's UPDATE. A 410 (rather than a 404 retry) tells the agent the row will never accept further observations — the correct action is to stop reporting against the path-id and either acquire a fresh NSK against a new Peer or exit the loop. The narrowness of the lookup-vs-update race makes 410 the right code shape: the resource was there, then went permanently away.
The OpenAPI surface at ../../../api/openapi/plexsphere-v1.yaml documents both codes alongside the four task-AC codes; they are part of the stable wire contract and the v1 spec will not narrow them without a deprecation cycle.
Six-gate handler ordering
endpoint.go runs the handler body as a strict, cheapest-first ordering. Each gate is cheaper than the next; reversing any pair would either burn a SQL round-trip on a malformed envelope or admit a wire-shape failure past a security boundary.
- NSK admission read — the middleware ran upstream;
step 1reads the resolved Node off the request context. A missing Node defensively returns401 unauthorizedso a misconfigured route that mounted the handler without the middleware fails closed. - Path-id gate —
middleware.MatchesPathID(node, id). A mismatch stampsAuditRelationNodeEndpointPathGate+AuditOutcomeNodeIDMismatch(distinct from the ingestion-phase relation so dashboards can split admission-versus-ingestion) and returns403 node_id_mismatch. - Body cap —
http.MaxBytesReader(w, r.Body, EndpointMaxBodyBytes). An oversize body trips a*http.MaxBytesErrorfrom the decoder; the handler maps it to413 endpoint_body_too_largewith the (limit, body_size>=) pair in the audit reason. - Body decode —
json.NewDecoder(...).DisallowUnknownFields(). A malformed envelope returns400 malformed_endpoint_requestbefore the clock-skew gate or any persistence read runs. - Clock-skew gate —
clock.WithinEndpointSkew(req.ReportedAt, now). A drifted or forgedreported_atreturns400 endpoint_clock_skewbefore the AddrPort parse or peer lookup. - AddrPort parse —
netip.ParseAddrPort(req.Endpoint). A pure-CPU validation that rejects unparseable strings before any database touch. Returns400 endpoint_unparseable.
After the six gates pass the handler runs the peer lookup (step 7) and the recorder call (step 8). The recorder owns the transactional surface — every failure inside the Store.InTx boundary surfaces through handleEndpointRecordError's sentinel switch, which maps peers.ErrEndpointObservationStale, peers.ErrEndpointUnparseable, peers.ErrEndpointClockSkew, and peers.ErrPeerAlreadyDeregistered (translated via the peersEndpointRecorderAdapter) onto the wire-stable HTTP codes listed in the refusal table.
DECISION (carried over from endpoint.go): PeerLookup runs AFTER the AddrPort parse gate. Alternative considered: resolve the peer first so a non-existent Node-to-Peer mapping fails fast. REJECTED because the cheaper pure-CPU AddrPort parse rejects a malformed wire envelope without touching the database; reversing the order would burn a SQL round-trip on every malformed request and let an attacker scanning for unparseable endpoints discover Node presence via the relative latency of the two refusal paths.
Schema
The endpoint surface extends the existing Peer aggregate from ./peers.md and the existing Domain aggregate. The authoritative source is the migration at ../../../internal/platform/db/migrations/0029_peer_endpoint.sql; the column meanings below paraphrase that file's DECISION blocks.
plexsphere.peer — endpoint observation columns
| Column | Type | Meaning |
|---|---|---|
last_endpoint | inet NULLABLE | The publicly-observed IP address from the most recent admitted observation. NULL means "no observation yet". The schema uses Postgres' native inet type rather than text so an operator can run SELECT * FROM plexsphere.peer WHERE last_endpoint << inet '10.0.0.0/8' for NAT-outage triage. |
last_endpoint_port | integer NULLABLE CHECK (... BETWEEN 1 AND 65535) | The publicly-observed transport port. NULL means "no observation yet". The CHECK pins the RFC 6056 port range at the storage boundary. |
last_endpoint_reported_at | timestamptz NULLABLE | The agent's reported_at from the admitted observation. The sweeper compares now() - last_endpoint_reported_at against the per-Domain TTL. |
last_endpoint_stale_at | timestamptz NULLABLE | The sweeper's tombstone timestamp. NULL means "still fresh (or never observed)"; non-NULL means "the sweeper stamped this row as stale and emitted a tombstone event". An UpdateEndpoint for a row with non-NULL last_endpoint_stale_at clears the stamp on the next successful write (un-stale recovery). |
Index:
peer_endpoint_stale_idx ON plexsphere.peer (last_endpoint_reported_at) WHERE deregistered_at IS NULL AND last_endpoint_stale_at IS NULL— backs the sweeper's range scan over rows still in the "watchable" cohort. Deregistered rows and already-stale rows are excluded by the partial predicate so the index stays small even as the tenancy grows.
plexsphere.domains — per-Domain endpoint TTL
| Column | Type | Meaning |
|---|---|---|
endpoint_ttl | interval NOT NULL DEFAULT '5 minutes' | The per-Domain endpoint freshness window. The schema-side default mirrors tenancy.DefaultEndpointPolicy.TTL; the application-layer EndpointPolicy value object enforces the 30s floor and 1h ceiling so a hand-edited row that violates the invariants fails fast at Hydrate time. |
The migration's Down block REFUSES the downgrade with SQLSTATE 0A000 (feature_not_supported). The four endpoint observation columns and the per-Domain endpoint_ttl carry the forensic evidence chain that links every NAT rebind to the wall-clock window the relay fallback was assigned in; dropping them is a regulatory retention regression. This Down stance mirrors 0008_node_secret_keys.sql, 0010_node_reachability.sql, 0011_audit_log.sql, and 0012_peers.sql — the four prior migrations that protect security- and audit-critical material with the same RAISE 0A000 refusal.
TTL policy & per-Domain knob
The per-Domain EndpointPolicy value object on the Tenancy aggregate (../../../internal/identity/tenancy/domain.go) carries one threshold. Default and invariant gates:
| Threshold | Default | Floor | Ceiling | Invariant |
|---|---|---|---|---|
TTL | 5m | ≥ 30s | ≤ 1h | Heartbeat-cadence agent (30s default) must be able to refresh inside the window |
The default is pinned by DefaultEndpointPolicy in domain.go:
go
var DefaultEndpointPolicy = EndpointPolicy{
TTL: 5 * time.Minute,
}A fully-zero EndpointPolicy{} on NewDomain / Hydrate is treated as the DEFAULT signal and replaced with DefaultEndpointPolicy at build time. resolveEndpointPolicy rejects an out-of-range TTL via errInvariant("", "Domain", "EndpointPolicy.TTL", "...") so a hand-edited Domain row that violates the floor or ceiling fails at Hydrate.
DECISION (carried over from domain.go): the policy is a single field rather than a tuple. Alternative considered: a sentinel "use default" boolean alongside TTL. REJECTED because a zero duration is already the well-formed signal in Go and the downstream INTERVAL column carries its own NOT NULL DEFAULT, so a boolean flag would be redundant scaffolding. The IsZero / default-injection contract mirrors the ReachabilityPolicy pattern so a reader of buildDomain sees the same shape on both policies.
The persistence layer mirrors the value object on plexsphere.domains.endpoint_ttl as an interval column defaulted to '5 minutes' at the schema level (0029_peer_endpoint.sql). The Tenancy aggregate's Hydrate boundary re-validates the bound on load.
Stale-endpoint sweeper
The EndpointSweeper in ../../../internal/mesh/peers/endpoint_sweeper.go is the periodic background ticker that drives the tombstone path — the only writer of last_endpoint_stale_at. The composition root in ../../../cmd/plexsphere/peers_factory_prod.go wires it alongside the Manager; an unwired (nil) Store is rejected at construction with ErrNilEndpointSweeperCollaborator.
Tick loop
Start(ctx) drives Sweep(ctx) on a fixed-cadence ticker. The default interval is endpointSweeperDefaultInterval = 1 * time.Minute (EndpointSweeperConfig.Interval overrides). One minute matches the partial-index range scan's amortised cost and keeps the sweeper from staging a torrent of tombstone events on a healthy fleet. Per-tick errors are logged but never abort the loop — a transient outage of the Postgres pool must not silently park the sweeper.
Per-tick procedure
Each Sweep(ctx) call:
- Reads
now = clock()via the injectedClockport (a virtual clock in tests so a tick advances "now" without sleeping the test goroutine). - Opens a single
Store.InTx— the whole batch lands as one transaction so a partial failure rolls back the entire batch (no half-stamped rows, no orphan tombstone events). - Calls
repo.ListStalePeerEndpoints(ctx, now, batchSize)— the query paginates thepeer_endpoint_stale_idxpartial index. TheendpointSweeperDefaultBatchSize = 256per-tick fan-out balances throughput (one tick clears 256 staleness transitions) against per-tx wall-clock (256 outbox INSERTs land before the tx commits). - For each returned row:
- Calls
repo.StampPeerEndpointStale(ctx, peerID, now)to writelast_endpoint_stale_at = now. - Constructs a tombstone
PeerEndpointChangedevent withEndpoint = ""(the stale-tombstone signal),PreviousEndpoint = renderEndpoint(prior), andEndpointReportedAt = row.LastEndpointReportedAt. - Calls
repo.AppendOutboxEvent(...)to enqueue the event in the same tx.
- Calls
counter.IncProcessed(len(rows))andcounter.IncStaleStamped(tombstoned)for observability.
Partial-index paging
The peer_endpoint_stale_idx partial b-tree on (last_endpoint_reported_at) WHERE deregistered_at IS NULL AND last_endpoint_stale_at IS NULL keeps the working set small: deregistered Peers and already-stale rows are excluded, so the sweeper's range scan never has to skip them on every tick. Alternatives considered (a plain b-tree over the full peer table; a sibling plexsphere.peer_endpoint_observation history table) are documented in the migration's DECISION block — both REJECTED for the indexing cost and the redundancy with the outbox event log.
Single-writer posture
The DECISION block at the top of endpoint_sweeper.go records why the sweeper is the SINGLE WRITER of last_endpoint_stale_at — the intake handler never flips the tombstone bit; only the un-stale recovery side of Manager.UpdateEndpoint clears it. The trade-off mirrors the reachability evaluator's read-path-rejector posture documented in ./reachability.md: a single writer per row keeps the transactional surface lockless, and the tombstone-vs-fresh state machine has exactly one home.
Event types
The endpoint surface emits exactly one outbox event type. The literal below is the bare snake_case string the relay routes endpoint events on; downstream consumers switch on this exact literal. The literal is one member of the closed six-event peer-side set (peer_registered, peer_psk_assigned, peer_deregistered, peer_endpoint_changed, rotate_keys, peer_key_rotated) documented in ./peers.md; the six together are the full peer-side wire contract.
peer_endpoint_changed
peer_endpoint_changed
Constructed by events.NewPeerEndpointChanged; constant TypePeerEndpointChanged = "peer_endpoint_changed". Emitted on four paths:
Manager.UpdateEndpointon the first-ever observation for the Peer (prior.LastEndpointnot valid).Manager.UpdateEndpointon every change of the(endpoint, port)tuple from the prior observation.Manager.UpdateEndpointon the un-stale recovery edge — the prior row hadlast_endpoint_stale_atnon-zero and the new observation cleared it.EndpointSweeper.Sweepon every tombstone — the per-rowlast_endpoint_stale_atstamp is paired 1:1 with apeer_endpoint_changedevent whoseEndpointis the empty string.
A same-(endpoint, port) refresh on a row that was NOT stale before the write is deliberately silent — the UPDATE lands so last_endpoint_reported_at stays fresh against the sweeper's TTL math, but no outbox row is enqueued. Downstream consumers do not need a wake-up for an identical refresh.
JSON payload shape:
json
{
"event_id": "<uuidv7>",
"occurred_at": "<RFC3339Nano UTC>",
"peer_id": "<uuidv7>",
"domain_id": "<uuidv7>",
"node_id": "<uuidv7>",
"endpoint": "<host:port or empty string>",
"endpoint_reported_at": "<RFC3339Nano UTC>",
"previous_endpoint": "<host:port or empty string>",
"fallback_endpoint": "<host:port, omitted when no bridge candidate is available>"
}The three empty-string variants the constructor explicitly allows:
endpoint == ""is the stale-tombstone signal emitted by the sweeper when an observation passes TTL without refresh.previous_endpoint == ""is the first-observation signal — there is no prior endpoint to carry through.fallback_endpoint == ""is the no-bridge-candidate signal where the relay-assigner could not pick a healthy bridge for this peer. Dropped from the JSON wire-form via theomitemptytag.
Validation invariants on the constructor: peer_id, domain_id, and node_id must all be non-zero; endpoint_reported_at must be set — a zero observation timestamp would silently desync the bus from the SQL row. The endpoint and previous strings MAY be empty.
The payload carries an additive optional fallback_endpoint string — the dial address plexd uses when the direct WireGuard handshake times out, sourced from the bridge Node the relay- assigner chose for this peer. The field is dropped from the JSON wire-form by the omitempty tag when no bridge candidate is available, so a legacy consumer that pre-dates the relay-fallback surface observes the original 8-field payload shape byte-for-byte. The two empty-string fields compose to form the silent-unreachability signal: endpoint == "" AND fallback_endpoint == "" means the direct path has gone stale AND no relay-fallback is available — the plexd agent has nowhere left to dial. The relay-assigner that owns the fallback_endpoint field and the rules for composing it with the endpoint snapshot are documented in ./relay-fallback.md.
Read-side projection
Provider.SnapshotEndpointsForDomain in ../../../internal/mesh/peers/provider.go returns []NodePeerEndpoint — the read-side projection of every live Peer in a Domain that carries a fresh observation. Stale rows (last_endpoint_stale_at IS NOT NULL) and first-observation-pending rows (last_endpoint IS NULL) are filtered out at the SQL boundary so the slice always carries an actionable observation set.
| Field | Type | Meaning |
|---|---|---|
NodeID | tenancy.ID | The peer's anchored Node id. Mirrors NodePeer.NodeID from the bootstrap snapshot so a relay-fallback consumer can join the two projections by Node id. |
LastEndpoint | netip.Addr | The NAT-observed IP. |
LastEndpointPort | int | The NAT-observed transport port. Pinned to 1..65535 by the SQL CHECK. |
LastEndpointReportedAt | time.Time | The agent-reported observation instant. Lets the consumer prefer the freshest observation when two snapshots arrive close together. |
DECISION (carried over from provider.go): NodePeerEndpoint is a distinct peers-package value object rather than a reuse of NodePeer or registration.Peer. The endpoint projection's shape (no MeshIP, no PublicKey, has reported_at) is orthogonal to the bootstrap projection; folding both onto one type would force the bootstrap path to carry endpoint fields it does not consume, and a downstream relay-fallback consumer to carry MeshIP / PublicKey it does not need.
excludeNodeID may be the zero value — a future audit-replay tool that wants the complete snapshot passes the zero value; the relay-fallback assignment surface passes the address of the consuming Node so the agent does not receive its own observation back.
Clock-skew rules
The endpoint handler admits an observation only when |server_now − client_reported_at| ≤ MaxEndpointSkew (60s). The constant is pinned in ../../../internal/platform/clock/skew.go against a golden literal (the test file pins the value at 60 * time.Second so an accidental adjustment surfaces in code review). Widening the bound is a security regression, not a tuning knob — see the file-level DECISION block.
The WithinEndpointSkew(t1, t2) helper is symmetric (callers may pass server and client time in either order) and inclusive at the boundary (delta == MaxEndpointSkew admits). The implementation branches on time.Time.Before to keep the subtraction in the non-negative half of the duration range — the same overflow-safety pattern documented for WithinSkew in ./reachability.md.
DECISION (carried over from skew.go): MaxEndpointSkew is pinned to a literal 60 * time.Second rather than aliased to MaxHeartbeatSkew. The two surfaces share a numeric value today but evolve independently — a future tightening of MaxHeartbeatSkew (because the heartbeat handler ships a real rate limiter) must NOT silently tighten the endpoint window the way a derived const would. Aliasing would hide that the two gates are independent security boundaries.
Manager.UpdateEndpoint runs the skew gate inside its own command body (defense-in-depth — the handler's gate already ran upstream) AND runs a second gate against now() - per-Domain TTL so a reported_at within the skew window but older than the freshness TTL surfaces as ErrEndpointObservationStale rather than landing the write.
Audit contract
Every decision branch in the endpoint surface emits exactly one audit.Entry per layer. The endpoint surface has TWO layers — the NSK middleware (admission) and the handler (ingestion) — so a granted observation produces TWO rows from the operator's perspective: one from the middleware (node_heartbeat.authenticate when the same middleware path-gates the endpoint route) and one with relation = node_endpoint.record, outcome = granted from the handler. Audit consumers therefore SHOULD NOT expect a 1:1 mapping between PUT requests and audit rows; they SHOULD filter by relation to scope to the layer they care about.
A nil AuditSink is tolerated — the slog lines stand in for observability — but the security gates still execute.
The relation strings live in ../../../internal/transport/http/v1/handlers/endpoint_deps.go:
| Constant | Value | Layer / origin |
|---|---|---|
AuditRelationNodeEndpointIngest | node_endpoint.record | Handler ingestion phase: clock-skew, malformed body, body-too-large, AddrPort parse, peer lookup, recorder failure, granted. |
AuditRelationNodeEndpointPathGate | node_endpoint.path_gate | Handler defense-in-depth path-id gate. Distinct relation so dashboards can alert on "middleware was bypassed but handler caught it" without conflating it with ingestion-phase entries. |
The closed AuditOutcome set is enumerated in ../../../internal/transport/http/v1/handlers/events_deps.go. The endpoint surface uses six distinct values so audit consumers triage clock-spoofing from malformed-body from invariant-violation from middleware-bypass from internal-error WITHOUT parsing free-form reason strings:
| Value | Emitter / cause |
|---|---|
granted | Every layer on success. |
node_id_mismatch | Handler path-id gate (defense-in-depth). |
clock_skew | Handler clock-skew gate (reported_at outside MaxEndpointSkew); recorder-side ErrEndpointObservationStale (older than per-Domain TTL); recorder-side ErrEndpointClockSkew (defense-in-depth). |
malformed_request | Handler JSON-decode arm; handler AddrPort parse arm. |
invariant_violation | PeerLookup reports no live Peer row; recorder returned ErrEndpointUnparseable or ErrEndpointPeerAlreadyDeregistered. |
internal_error | PeerLookup or recorder returned a non-sentinel error (Postgres outage, adapter bug). |
insufficient_relation | Body-too-large path uses this outcome (mirrors the heartbeat surface's body-size contract; the handler does not have a dedicated body-size outcome). |
| Branch | Source | Relation | Outcome | Reason |
|---|---|---|---|---|
| Endpoint observation admitted | endpoint.go step 9 | node_endpoint.record | granted | endpoint observation recorded |
| Path-id mismatch (defense-in-depth) | endpoint.go step 2 | node_endpoint.path_gate | node_id_mismatch | resolved Node id does not match URL path id (defense-in-depth) |
| Body too large | endpoint.go step 4 (http.MaxBytesError arm) | node_endpoint.record | insufficient_relation | body exceeds EndpointMaxBodyBytes: limit=<N> body_size_bytes>=<N+1> |
| Malformed body | endpoint.go step 4 | node_endpoint.record | malformed_request | malformed endpoint envelope: <reason> |
| Clock-skew refusal | endpoint.go step 5 | node_endpoint.record | clock_skew | reported_at outside MaxEndpointSkew window |
| Unparseable endpoint | endpoint.go step 6 | node_endpoint.record | malformed_request | unparseable endpoint: <reason> |
| Peer lookup infra failure | endpoint.go step 7 | node_endpoint.record | internal_error | peer lookup failed: <reason> |
| No live Peer for Node | endpoint.go step 7 | node_endpoint.record | invariant_violation | no live Peer row resolves for the authenticated Node |
| Recorder clock-skew (defense-in-depth) | endpoint.go handleEndpointRecordError | node_endpoint.record | clock_skew | recorder clock-skew invariant violated: <reason> |
| Recorder observation stale (older than TTL) | endpoint.go handleEndpointRecordError | node_endpoint.record | clock_skew | reported_at older than per-Domain endpoint TTL: <reason> |
| Recorder endpoint unparseable (defense-in-depth) | endpoint.go handleEndpointRecordError | node_endpoint.record | invariant_violation | recorder endpoint invariant violated: <reason> |
| Recorder peer already deregistered | endpoint.go handleEndpointRecordError | node_endpoint.record | invariant_violation | target Peer was deregistered between lookup and update: <reason> |
| Recorder internal error | endpoint.go handleEndpointRecordError | node_endpoint.record | internal_error | endpoint persistence failed: <reason> |
OpenAPI surface reference
The wire surface is pinned in ../../../api/openapi/plexsphere-v1.yaml under the mesh tag. The relevant operationIds and schemas:
| OpenAPI artefact | Where in the spec | Go counterpart |
|---|---|---|
PutNodeEndpoint operation | paths./v1/nodes/{id}/endpoint.put | Handlers.PutNodeEndpoint in endpoint_dispatch.go (gated entry) and putNodeEndpoint in endpoint.go (body). |
EndpointRequest schema | components.schemas.EndpointRequest | handlers.EndpointRequest in endpoint.go. |
EndpointResponse schema | components.schemas.EndpointResponse | handlers.EndpointResponse in endpoint.go. |
Problem schema | components.schemas.Problem (shared) | The six endpoint refusal codes are endpoint_clock_skew, endpoint_unparseable, malformed_endpoint_request, endpoint_body_too_large, endpoint_peer_not_found, endpoint_peer_gone; the 501 fail-closed stub emits endpoint_not_provisioned; the 401 / 403 arms emit nsk_revoked / node_id_mismatch. |
PermissionDenied schema | components.schemas.PermissionDenied (shared) | The 403 node_id_mismatch arm uses the richer PermissionDenied shape that carries reason, relation_path, and correlation_id alongside the bare Problem fields. |
The OpenAPI byte-equality drift gate at ../../../tests/workspace/openapi_drift_test.go asserts that the source spec and the embedded mirror under internal/transport/http/v1/handlers/plexsphere-v1.yaml are byte-identical, so a spec drift surfaces at go test time, not at runtime.
Threat model
The endpoint surface mitigates four classes of attack along the plexd → plexsphere ingestion path. Each mitigation is implemented in a single, named place so a reader chasing a security claim does not have to assemble it from multiple files.
- Observation-replay via clock spoofing. A compromised plexd with a forged client clock could replay a stale endpoint observation (masking that the agent has fallen silent on a NAT rebind) or pin a fresh-but-future timestamp (so the observation outlives its real reported_at window). Mitigation: the handler admits the observation only when
|reported_at − server_now| ≤ MaxEndpointSkew (60s)— pinned at../../../internal/platform/clock/skew.go— andManager.UpdateEndpointadditionally rejects anyreported_atolder thannow() - per-Domain TTLasErrEndpointObservationStaleso an observation that drifted inside the skew window but past the freshness window cannot smuggle a stale write past the sweeper. Both gates fire BEFORE the per-Peer aggregate's tx surface. - Cross-Node impersonation. A Node with a valid NSK could attempt to overwrite a sibling's endpoint by PUTting to
/v1/nodes/{otherId}/endpoint. Mitigation: the sameNSKAuthenticatormiddleware the heartbeat surface uses (../../../internal/identity/authn/middleware/nsk.go) resolves the bearer to atenancy.Nodeand enforces a path-id-equals-resolved-Node-id gate; a mismatch surfaces as403 node_id_mismatch. The handler runs the same check as defense-in-depth (step 2) so a misconfigured router that mounts the handler without the middleware still rejects, AND the distinctAuditRelationNodeEndpointPathGaterelation lets the audit log distinguish "middleware was bypassed but handler caught it" from a normal ingestion-phase entry. - Pathological-body resource exhaustion. A hostile or buggy agent could stream an unbounded payload to tie up the JSON decoder or starve concurrent requests. Mitigation: the handler caps the body at
EndpointMaxBodyBytes = 4 KiBviahttp.MaxBytesReaderBEFORE the decoder ever sees the bytes (step 3). An overflow surfaces as413 endpoint_body_too_largewith a stablecodeso log scrapers can alert on the body-cap floor. - Forensic-audit-trail loss. The migration's Down block REFUSES the downgrade with SQLSTATE
0A000. The four endpoint observation columns onplexsphere.peerand the per-Domainendpoint_ttlcolumn onplexsphere.domainscarry the forensic evidence chain that links every NAT rebind to the wall-clock window the relay fallback was assigned in; dropping the columns is a regulatory retention regression that mirrors the NSK invariant0008protects. Soft-tombstone onlast_endpoint_stale_at(rather than overwriting the prior endpoint) keeps the audit chain intact through every sweeper tick.
Sentinel errors callers branch on (defined in ../../../internal/mesh/peers/errors.go):
| Sentinel | Path | Caller branch |
|---|---|---|
ErrEndpointClockSkew | Manager.UpdateEndpoint clock-skew gate | Fail-closed; surface 400 endpoint_clock_skew |
ErrEndpointObservationStale | Manager.UpdateEndpoint TTL gate (reported_at older than now() - TTL) | Fail-closed; surface 400 endpoint_clock_skew with a distinct audit reason |
ErrEndpointUnparseable | Manager.UpdateEndpoint AddrPort invariant | Fail-closed; surface 400 endpoint_unparseable (defense-in-depth against the handler-side parse) |
ErrPeerAlreadyDeregistered | Manager.UpdateEndpoint after-resolve gate; UPDATE no-op race | Surface 410 endpoint_peer_gone so the agent stops reporting against the gone row |
ErrNilEndpointSweeperCollaborator | NewEndpointSweeper with nil Store | Boot-time misconfig alert |
What this context is NOT
This context ships the per-Node NAT endpoint intake surface and the per-Peer endpoint observation snapshot. The following adjacent concerns are explicit non-goals — each is owned by a downstream context:
- Relay-fallback assignment. The relay-fallback orchestrator's choice of fallback bridge per peer pair MAY consume
Provider.SnapshotEndpointsForDomain, but the heuristics (latency, load, churn) are owned by the relay-fallback assignment context. The endpoint surface simply maintains the snapshot. The relay-fallback context now also extends thepeer_endpoint_changedpayload with an additivefallback_endpointfield; the silent-unreachability composite signal occurs when both theendpointand thefallback_endpointstrings on the same payload are empty — endpoint TTL has expired AND no bridge is available — and downstream consumers MAY render the degraded state directly without further inference. See./relay-fallback.mdfor the chooser, the assignment table, and the silent-unreachability composition rule. - Endpoint history. This surface persists the per-Peer SNAPSHOT (the latest admitted observation plus the tombstone flag). The observation HISTORY lives in the outbox event log; a future
plexsphere.peer_endpoint_observationhistory table is explicitly out of scope, and the migration's DECISION block records the alternative as rejected. - NAT-type classification verification. The
nat_typefield is forwarded verbatim from the agent's classifier for downstream observability. The control plane does NOT verify the classification; a future telemetry pipeline that cross-checks reportednat_typeagainst observed endpoint behaviour is owned by a separate context. - Operator-tunable
MaxEndpointSkew. The 60-second bound is a security invariant, not a tuning knob — the file-level DECISION block ininternal/platform/clock/skew.gorejects the configuration alternative because widening would silently erode the replay-resistance the threat model relies on. A future per-Domain knob is NOT planned. - Rate limiting endpoint observations below the per-Domain TTL. An agent that PUTs faster than the per-Domain
TTLcadence is NOT rate-limited here — the surface admits every well-formed observation and the recorder runs the full read/compare/write cycle on every call. Per-Node rate limiting is a future surface. - SSE drain of
peer_endpoint_changed. The outbox row is durable on every transition today, but the SSE relay drain that surfaces the row onGET /v1/nodes/{id}/eventsis tracked separately in../../architecture/mesh-event-bus-roadmap.md. The wire shape is stable today so downstream subscribers can pin against it. - Fallback-endpoint advisory in
peer_endpoint_changed. The payload deliberately omits afallback_endpointfield. The relay-fallback assignment surface owns that extension in a later story; widening the JSON shape with an additive optional field is a non-breaking change for downstream consumers when the time comes.