Skip to content

Reconciliation pull — GET /v1/nodes/{id}/state

This document is the authoritative bounded-context reference for the reconciliation-pull surface that ships under internal/mesh/state and exposes through GET /v1/nodes/{id}/state. It covers the ubiquitous language, the snapshot envelope shape, the four-block convergence with the SSE event taxonomy, the security-gate ordering, the OpenAPI cross- reference, and the deferred-wiring posture.

The pull is the plexd → plexsphere fallback channel: a single authoritative cold-start view that plexd consumes when it first comes up, when its Signed SSE Event Bus connection has been disconnected for longer than the replay window, or when an out-of-band request arrives to re-derive the desired state. Anything outside that surface — the per-Node SSE push at GET /v1/nodes/{id}/events, the underlying outbox relay, the JetStream broker, the policy compiler, the bridge orchestrator — is a collaborator the pull surface projects from, not a concern of this document. This document is the implementation-side reference for the README's Node State Service specification; the README pins the reconciliation contract and this document pins the in-process composition pipeline that produces it .

Status — partial delivery

This story ships the wire envelope, the handler body, the snapshot composer, and the production composition root of the reconciliation-pull surface. The handler is fail-closed by construction: the gate in internal/transport/http/v1/handlers/state_dispatch.go returns 501 Not Implemented with code signed_event_bus_not_provisioned whenever any of the three load- bearing security ports — SnapshotProvider, RelationChecker, NodeRepo — is nil. Three control-plane gates that the README and this document name as part of the pull's contract are deferred to follow-up work and are NOT live in the production binary today :

  • Production wiring of the RelationChecker port — the handler refuses to mount without it, so the addressed Node's node-agent ReBAC relation cannot be enforced today. The composition root in cmd/plexsphere/state_factory_prod.go populates SnapshotProvider and NodeRepo only; until the upstream ReBAC authorizer wiring lands a *authz.Authorizer-backed adapter on handlers.Deps.RelationChecker, every production request continues to fall through the state_dispatch.go gate to 501. The 403 path documented in the OpenAPI spec is therefore unreachable from the production binary even after PLEXSPHERE_DSN is set.
  • Production wiring of the AuditSink port — the handler emits audit rows for the granted and insufficient_relation outcomes through handlers.Deps.AuditSink when one is supplied, but the composition root does not yet plumb a *audit.Sink-backed adapter through. The control flow is unaffected — audit emission is best- effort and a nil sink simply suppresses the row — but operators scraping relation=node_state.pull outcomes will see no rows until the sink is wired.
  • Production wiring of the SnapshotProvider and NodeRepo ports on the upstream binary pathBuildProductionStateFactory returns a non-nil bundle ONLY when PLEXSPHERE_DSN is set and the factory closure is invoked from cmd/plexsphere/main.go. The factory is the wiring receipt that prevents the earlier deferred- wiring lesson from recurring on this surface (the chainsaw E2E fixture and the source-substring drift gate at ../../../tests/workspace/state_factory_wiring_receipt_test.go pin the regression). With PLEXSPHERE_DSN unset the binary keeps the 501 stub, by design.

The chainsaw E2E fixture at ../../../tests/e2e/mesh/reconciliation-pull/chainsaw-test.yaml stands the production plexsphere binary up against a Postgres + SpiceDB stack inside a kind cluster, registers a Domain and N Nodes through the production POST /v1/domains/{id}/nodes surface, and asserts that the peers projection is correct. It carries skip: true until the SSE producer wiring lands because the fixture's terminal convergence gate (step 4 in its file-level DECISION block) issues an SSE peer_registered event and asserts the (node_id, mesh_ip, public_key) tuples agree byte-for-byte between the SSE channel and the pull response. That gate depends on cmd/plexsphere/sse_factory_prod.go's EventStream / EventPublisher seams being non-nil, which the SSE roadmap documents as deferred. The integration-tier equivalent at ../../../tests/integration/state_pull_sse_equivalence_test.go exercises the same convergence story without the chainsaw round-trip and is GREEN today.

What does ship and is load-bearing today:

Downstream stories that depend on the reconciliation-pull surface (the policy fan-out, the bridge orchestrator, and the node-state reports) MUST treat the placeholder blocks as null until their own event-types land — the wire field is present from day one so a diff-by-presence reconcile loop in plexd does not need a second deployment to start observing it.

Cross-references

Ubiquitous language

The terms below travel together across the Go code, the OpenAPI contract, the audit log, the structured-log attributes, and operator- facing tooling. Names are preserved verbatim in error messages and audit row vocabulary so a reader chasing a string from a log line finds it in the source without translation.

TermDefinition
Reconciliation PullThe HTTP-shaped fallback channel through which plexd re-derives a Node's desired state on cold start, after a replay-window-out-of-bounds disconnect, or on an explicit operator request. Served at GET /v1/nodes/{id}/state.
NodeStateSnapshotThe canonical reconciliation-pull envelope (internal/mesh/state/snapshot.go) with five wire blocks: peers, policy, bridge, state, reports. The OpenAPI schema of the same name (NodeStateSnapshot) is the wire mirror.
Addressed NodeThe Node whose id is on the request path. The snapshot is composed for this Node — the peers projection excludes it explicitly so plexd does not program a self-peer.
Domain PeerAny other Node in the addressed Node's Domain. Each domain peer projects into one entry in NodeStateSnapshot.Peers carrying the (NodeID, MeshIP, PublicKey) triple plexd needs to program a WireGuard tunnel.
SnapshotProviderThe transport-side port (internal/transport/http/v1/handlers/state_deps.go) the handler calls to compose the envelope. The production concrete is *state.SnapshotComposer; tests substitute a recording fake. The same signature is mirrored on the bounded-context port state.SnapshotProvider so the concrete satisfies both interfaces simultaneously.
PeerSourceThe persistence-side seam (internal/mesh/state/snapshot.go) the composer reads peer rows through. Production wiring is the peerSourceAdapter in cmd/plexsphere/state_factory_prod.go; tests substitute a recording fake without pgx machinery. The seam exists so internal/mesh/state stays free of direct pgx imports — the no-direct-persistence-from-contexts depguard rule denies them.
PeerRowThe per-row shape PeerSource yields to the composer. Byte-shape-compatible with sqlcgen.SnapshotPeersForDomainRow projected into domain types. A row with a corrupt MeshIP / empty PublicKey reaches the composer for filtering — the source MUST NOT silently drop.
Per-peer invariantsThe trio enforced inside the composer: NodeID != uuid.Nil, MeshIP.IsValid() && !MeshIP.IsUnspecified(), len(PublicKey) > 0. A row that fails any check is dropped with a single WARN-level slog line and the snapshot composes from the survivors.
Placeholder BlockA wire field (policy, bridge, state, reports) that ships as JSON null today and will carry a populated value object once its owning downstream story lands. The field itself is always present on the wire so plexd's reconcile loop can diff by presence rather than absence.
AuditRelationNodeStatePullThe canonical Relation string the state handler stamps on every audit row: node_state.pull. Mirrors AuditRelationNodeEventsSubscribe so the push and pull surfaces emit consistent audit shapes.
node-agent relationThe ReBAC relation enforced on both GET /v1/nodes/{id}/state and GET /v1/nodes/{id}/events. Any caller authorised to subscribe to a Node's SSE event stream is also authorised to issue a reconciliation pull.
signed_event_bus_not_provisionedThe Problem-Details code returned by the 501 fail-closed stub. Shared with the SSE peer endpoint so a single alert rule scrapes both surfaces — the two unblock together when the composition root wires the deferred ports.
Wiring ReceiptThe lesson the earlier SSE wiring left behind: a production binary whose composition root quietly leaves a load-bearing port nil silently 501s, and the gap is invisible until the first request lands in production. BuildProductionStateFactory is the wiring receipt for this surface.

Snapshot envelope diagram

The wire envelope is the state.NodeStateSnapshot value type defined in internal/mesh/state/snapshot.go. The OpenAPI schema of the same name is its byte-mirror; the transport layer projects the Go value onto the schema directly without an intermediate DTO.

text
   ┌────────────────────────────────────────────────────────────┐
   │  GET /v1/nodes/{addressed-id}/state                        │
   │      Accept: application/json                              │
   │      Authorization: Bearer <node-agent JWT>                │
   └─────────────────────────┬──────────────────────────────────┘


   ┌────────────────────────────────────────────────────────────┐
   │  NodeStateSnapshot                                         │
   ├────────────────────────────────────────────────────────────┤
   │  peers        [ NodeStatePeer, NodeStatePeer, … ]          │
   │                 ── one entry per OTHER Node in the Domain  │
   │                 ── ordered by node_id ASC (REQ-005)        │
   │                 ── per-peer invariants enforced (REQ-013)  │
   │                                                            │
   │  reachability { state, last_heartbeat_at, changed_at }     │
   │                 ── always present (PX-0019, REQ-009)       │
   │                                                            │
   │  policy       null  (placeholder until S028)               │
   │  bridge       null  (placeholder until S037)               │
   │  state        null  (placeholder until S055)               │
   │  reports      null  (placeholder until S055)               │
   └────────────────────────────────────────────────────────────┘

   Each NodeStatePeer carries the minimum plexd needs to program a
   WireGuard tunnel — the triple is byte-compatible with the
   RegisterPeer payload so the bootstrap and the reconciliation
   pull converge on a single wire contract (PX-0018, REQ-001;
   PX-0015, REQ-001):

   ┌────────────────────────────────────────────────────────────┐
   │  NodeStatePeer                                             │
   ├────────────────────────────────────────────────────────────┤
   │  node_id     uuid (UUIDv7)                                 │
   │  mesh_ip     string ("10.42.0.7")                          │
   │  public_key  string (base64-padded, 44-char)               │
   └────────────────────────────────────────────────────────────┘

The peers array is always present — empty [] when the addressed Node has no peers, never null. The four placeholder blocks may be null until their owning story populates them. Both posistions are pinned in the OpenAPI schema's required keyword and covered by the integration-tier happy-path test at ../../../tests/integration/state_pull_happy_path_test.go.

The reachability block is always present — never null — and projects the addressed Node's heartbeat-driven liveness verdict so an SSE subscriber and a follow-up GET /v1/nodes/{id}/state observe the same to_state after a transition. The Go counterpart is state.Reachability in ../../../internal/mesh/state/snapshot.go, with the shape {State: string, LastHeartbeatAt: time.Time, ChangedAt: time.Time}. The state field is one of healthy, stale, unreachable, or "" ("not yet evaluated"); last_heartbeat_at is the wall-clock time of the last accepted heartbeat (zero when none has been received yet); changed_at is the wall-clock time the state field last transitioned. The block is projected from the corresponding reachability_state, last_heartbeat_at, and changed_at columns on plexsphere.nodes (added by migration ../../../internal/platform/db/migrations/0010_node_reachability.sql) through the ReachabilitySource SQL seam declared alongside PeerSource. See ./reachability.md for the reachability state-machine semantics.

Four-block convergence with the SSE event taxonomy

Every block in NodeStateSnapshot has a counterpart event-type on the Signed SSE Event Bus. The pull is the cold-start fallback for the same state the bus delivers as incremental deltas; the table below pins which event-type populates each block, which downstream story owns the block, and where the block currently sits in the rollout.

Snapshot blockSSE event-type counterpartOwning storyStatus today
peerspeer_added, peer_removed, peer_key_rotated, peer_endpoint_changedthis surface; the pull is the cold-start fallback for the SSE peer-delta streamLive. The composer projects the per-Domain peer set excluding the addressed Node, ordered by node_id ASC, with per-peer invariants enforced. Byte-equality with the SSE peer projection is pinned by ../../../tests/integration/state_pull_sse_equivalence_test.go.
policypolicy_updatedPolicy Engine fan-outLive. The composer reads the per-(Node, Policy) plexsphere.policy_compiled_ruleset rows through the PolicySource port and projects the merged (revision_id, fingerprint, rules) shape onto NodeStateSnapshot.Policy; a Node with no matched Policy still sees JSON null. The wire-side policy_updated fan-out runs from the compile-service arm with byte-equal fingerprint, pinned by ../../../tests/integration/policy_updated_wire_fanout_test.go.
bridgebridge_config_updatedBridge OrchestratorLive. The composer reads the four bridge aggregate sets plus the per-peer relay-assignment rows through the BridgeSource port and projects the effective.EffectiveConfigBuilder output onto NodeStateSnapshot.bridge; a Node hosting no bridge Resource still sees JSON null. The wire-side bridge_config_updated fan-out runs from the bridge application services with byte-equal effective_config, pinned by the parity suite at tests/integration/bridge_config_updated_pull_parity_test.go. The README's "same payload, two channels" contract holds: the SSE effective_config bytes equal the pull bridge block for the same Node and bridge Resource. See ../bridge/events.md for the publisher-side dispatch table and the per-Node fan-out algorithm.
state (and reports)node_state_updatedNode State ServiceLive. The composer reads the addressed Node's node-state entries through the EntriesSource port and fans them by kind into the three NodeStateReports buckets — platform-owned metadata, platform-owned data, and upstream reports — ordered by key ascending; each bucket is [] (never null) when the Node carries no entry of that kind. reports mirrors state (the composer points both at the same value object) so a future split between "live state" and "rolled-up reports" lands without an OpenAPI break. Only platform-owned metadata/data writes fan out a node_state_updated event; an upstream report write does not. The SSE/pull equivalence is pinned by ../../../tests/integration/nodestate_snapshot_convergence_test.go. See ../state.md for the Node State Service model, the report ACL, and the closed event set.

The convergence rule the table encodes is the README's reconciliation contract: every state delivered as an SSE event-type MUST also be recoverable from the pull response. Today only the peers block carries that guarantee end-to-end; the placeholder blocks make the contract structurally explicit so a downstream caller is not surprised by a future schema growth.

Security-gate ordering

The handler in internal/transport/http/v1/handlers/state.go runs four gates in a deliberate, fail-closed order. The order is security-driven, not performance-driven: each step rejects a strict superset of the requests the next step would have rejected, so re-ordering any pair admits a leak or burns work an earlier reject would have skipped.

The same ordering is used by the SSE peer endpoint (./sse.md) — both surfaces share the contract so a single mental model covers GET /v1/nodes/{id}/events and GET /v1/nodes/{id}/state.

StepGateReject behaviourWhy this position
1Authnauthn.FromContext(ctx) returns a non-KindUnknown Principal.401 Unauthorized with code unauthorized.Reject unsigned callers BEFORE we burn a SpiceDB round-trip.
2Authz (ReBAC)RelationChecker.Check(ctx, principal, "node-agent", "node:<id>") returns (true, nil).403 Forbidden with code insufficient_relation; audit row outcome=insufficient_relation is emitted when a sink is wired.Reject the unauthorised caller BEFORE we leak Node existence via the 404/403 timing differential.
3LookupNodeRepo.GetByID(ctx, id) returns (node, nil); an errors.Is(err, ErrNodeNotFound) miss renders 404.404 Not Found with code node_not_found on the not-found arm; 500 Internal Server Error on any other error.Reject the missing Node BEFORE we burn a snapshot composition. Done AFTER authz so an unauthorised caller cannot probe Node existence. The single GetByID call replaces the previous Exists-then-GetByID double round-trip — the lookup doubles as the 404 gate AND the DomainID resolver for the next step.
4Snapshot compositionSnapshotProvider.SnapshotForNode(ctx, id, domainID) returns (snapshot, nil).500 Internal Server Error with no public code; the underlying error is logged with structured fields.Compose the wire body LAST. The composer's contract guarantees a single SQL round-trip and ascending peer order so the response is deterministic and byte-stable across consecutive pulls.

A successful pass through all four gates emits an audit row with relation=node_state.pull, outcome=granted, subject=<principal>, object=node:<id> (when an AuditSink is wired) and writes the NodeStateSnapshot JSON body with 200 OK. A nil AuditSink suppresses the audit row but does not change the security control flow — the gate at the top of state_dispatch.go deliberately excludes AuditSink so a missing sink does not over-fire the 501 stub .

The 501 fail-closed stub at internal/transport/http/v1/handlers/state_dispatch.go sits in front of every gate above. It refuses to dispatch to the body when ANY of the three load-bearing security ports is nil, returning 501 Not Implemented with code signed_event_bus_not_provisioned. The code is shared with the SSE peer endpoint because the two surfaces unblock together when the composition root wires the deferred ports — a single alert rule on the code suffices to catch either surface in the deferred posture.

OpenAPI cross-reference

The wire surface is pinned in ../../../api/openapi/plexsphere-v1.yaml under the mesh tag. The relevant operationIds and schemas:

OpenAPI artefactWhere in the specGo counterpart
GetNodeState operationpaths./v1/nodes/{id}/state.getHandlers.GetNodeState in internal/transport/http/v1/handlers/state_dispatch.go (gated entry) and getNodeState in internal/transport/http/v1/handlers/state.go (body).
NodeStateSnapshot schemacomponents.schemas.NodeStateSnapshotstate.NodeStateSnapshot in internal/mesh/state/snapshot.go; projected onto server.NodeStateSnapshot (oapi-codegen output) by buildNodeStateResponse in state.go.
NodeStatePeer schemacomponents.schemas.NodeStatePeerstate.Peer in snapshot.go; projected onto server.NodeStatePeer with the PublicKey field base64-encoded with standard padding to mirror RegisterPeer's convention.
NodeStatePolicy schemacomponents.schemas.NodeStatePolicystate.Policy (placeholder until the Policy Engine fan-out lands).
NodeStateBridge schemacomponents.schemas.NodeStateBridgestate.Bridge (placeholder until the Bridge Orchestrator lands).
NodeStateReports schemacomponents.schemas.NodeStateReportsstate.Reports — the composer fans the addressed Node's entries into the Metadata / Data / Reports buckets and points both the state and reports blocks at the same value (the mirror). See ../state.md.
Problem schemacomponents.schemas.Problem (shared)problem.Problem in the transport layer. The state handler emits codes unauthorized, insufficient_relation, node_not_found, signed_event_bus_not_provisioned.

The OpenAPI byte-equality drift gate at ../../../tests/integration/state_pull_openapi_drift_test.go asserts that the source spec at ../../../api/openapi/plexsphere-v1.yaml and the embedded mirror under ../../../internal/transport/http/v1/handlers/ are byte-identical, AND that the generated ServerInterface.GetNodeState signature pins the (http.ResponseWriter, *http.Request, openapi_types.UUID) shape via reflect.Type inspection. A spec drift therefore surfaces at go test time, not at runtime.

Composition flow

The composition flow runs strictly in the order shown — each step's output is the next step's input, and no step is skipped on the happy path:

text
   ┌────────────────────────────────────────────────────────────┐
   │  HTTP request: GET /v1/nodes/{id}/state                    │
   └────┬───────────────────────────────────────────────────────┘
        │ 1. authn middleware attaches Principal to ctx
        v
   ┌────────────────────────────┐
   │  state_dispatch.go : gate      │  if SnapshotProvider, RelationChecker,
   └────┬───────────────────────┘  or NodeRepo is nil → 501

        v
   ┌────────────────────────────┐
   │  state.go : getNodeState   │
   └────┬───────────────────────┘
        │ 2. authn check       ────────► 401 if no Principal
        │ 3. RelationChecker   ────────► 403 if not allowed
        │                                 (audit: insufficient_relation)
        │ 4. NodeRepo.GetByID  ────────► 404 if ErrNodeNotFound
        │                                resolve DomainID otherwise
        v
   ┌────────────────────────────┐
   │  SnapshotComposer          │  PeersForDomain (single SQL round-trip)
   │  .SnapshotForNode          │     ── ORDER BY node_id ASC
   └────┬───────────────────────┘     ── id <> $2 (exclude addressed)
        │ 5. drop rows that fail per-peer invariants
        │    (WARN-log each drop with structured fields)
        │ 6. defensive copy of PublicKey slice per row
        v
   ┌────────────────────────────┐
   │  buildNodeStateResponse    │  base64-encode PublicKey
   │  (state.go)                │  Policy/Bridge/State/Reports = nil
   └────┬───────────────────────┘
        │ 7. emit audit row outcome=granted
        v
   ┌────────────────────────────┐
   │  writeJSON 200             │
   └────────────────────────────┘

Failure modes are typed: errZeroNodeID and errZeroDomainID short- circuit at step 4 with a 500 (the errors.Is(err, ErrNodeNotFound) arm renders 404 instead); a PeersForDomain error short-circuits at step 5 with a 500 wrapping the underlying error; and any nil port short-circuits at the entry stub with a 501. The composer NEVER returns a 200 with a partial peer set on an error path — partial success is preserved only on the per-peer invariant filter, which is a row-level WARN-and-continue path required by the per-peer invariant contract.

Composition root and the production wiring receipt

cmd/plexsphere/state_factory_prod.go is the wiring receipt for this surface. It validates productionStateConfig at BUILD time (not inside the returned closure), so a misconfigured operator sees the failure before /readyz lights up green. An empty PLEXSPHERE_DSN returns (nil, nil) deliberately — the caller in cmd/plexsphere/main.go falls back to the 501 stub path so an operator who has not opted into production wiring sees the gap during boot, not during the first pull.

The peerSourceAdapter at the bottom of the file is the SOLE place in the tree where the internal/mesh/state package brushes against pgx. The trade-off — a single uuid.UUID → pgtype.UUID translation — is the price of keeping the no-direct-persistence-from-contexts depguard rule clean. The adapter performs exactly one SQL round-trip per call (the SnapshotPeersForDomain query in ../../../internal/platform/db/queries/10_tenancy.sql) and the composer's bench (../../../internal/mesh/state/snapshot_bench_test.go) asserts that contract on a 64-Node Domain.

The wiring-receipt drift gate at ../../../tests/workspace/state_factory_wiring_receipt_test.go reads cmd/plexsphere/main.go from disk and asserts both the BuildProductionStateFactory( call and the StateFactory: Config assignment are present — the test exists specifically to catch the earlier wiring-receipt regression: a production binary whose composition root quietly leaves a load-bearing port nil silently 501s, and the gap is invisible until the first real request lands. The end-to-end factory coverage proper lives at the integration tier in ../../../tests/integration/state_pull_happy_path_test.go and its peers, which boot BuildProductionStateFactory against a testcontainers Postgres and assert a non-error response.

Operator runbook (post-wiring)

Until the deferred items in ../../architecture/mesh-event-bus-roadmap.md close and RelationChecker + AuditSink are wired through the composition root, the production binary always returns 501 on this endpoint and there is nothing to inspect operationally. The post-wiring runbook below is the entry points an operator will chase once the surface is hot.

  • Inspect a single Node's snapshot. The pull is idempotent; an operator can issue GET /v1/nodes/{id}/state with a node-agent scoped JWT and diff the response against kubectl get plexdnodestate on the addressed Node. The peers array is sorted by node_id ASC so two consecutive pulls against the same ledger snapshot are byte- equal — a non-empty diff is a real change, not a non-determinism.
  • Inspect a Domain's peer projection. The SQL query backing the pull lives at ../../../internal/platform/db/queries/10_tenancy.sql under the :many SnapshotPeersForDomain name. An operator can issue the same projection by hand: SELECT node_id, mesh_ip, public_key FROM plexsphere.tenancy_nodes WHERE domain_id = $1 AND id <> $2 ORDER BY node_id ASC; and compare against the pull response.
  • Watch the relation=node_state.pull audit stream. Once the AuditSink is wired the handler emits one audit row per request with outcome=granted (success) or outcome=insufficient_relation (403). A spike of insufficient_relation correlates with a misconfigured ReBAC tuple set; a sustained absence of any rows correlates with the deferred wiring still being in place — check the 501 rate on /v1/nodes/{id}/state first.
  • Look for WARN-on-drop log lines. The composer emits a single WARN-level slog line per dropped peer row carrying outcome=invalid_node_id / invalid_mesh_ip / empty_public_key, plus the domain_id, addressed_node_id, and peer_node_id fields. A non-zero rate is the operator's signal that the ledger holds a corrupt row that should be reconciled — the snapshot itself remains correct because the composer drops the offending row from the response.