Appearance
Metrics & Logs query — bounded read-only proxy onto Mimir / Loki
This document is the authoritative bounded-context reference for the Observability Query context — the read-only side of the observability stack. Where the ingest front door admits and buffers telemetry and the routing consumer drains it to the backends, this context lets an operator read it back: it is a bounded PromQL/Mimir and LogQL/Loki query proxy. It accepts an operator's instant or range query, enforces hard query bounds so a single query cannot overwhelm the bundled Grafana Mimir and Grafana Loki backends, forwards the bounded request to the matching backend, and returns the backend's raw JSON envelope verbatim. The domain root that pins the ubiquitous language is ../../../internal/observability/query.
The context is a read proxy, not an aggregate. Unlike the alerts and incidents contexts it owns no persisted entity and no lifecycle — it owns the query bounds, the two backend ports and their HTTP adapters, the upstream-error classification, and the application service that validates a query before forwarding it. The raw backend JSON is returned as-is; shaping it into a transport surface is a higher layer's concern, not this context's.
Ubiquitous language
The terms below travel verbatim across the domain root, the bounds, the classifier, the two backend ports, their adapters, and the application service.
| Term | Definition | Code anchor |
|---|---|---|
| InstantQuery | A single PromQL instant query: an expression evaluated at one point in time. Carries Expr and Time. | ../../../internal/observability/query/service.go (InstantQuery) |
| RangeQuery | A PromQL or LogQL range query: an expression evaluated across a bounded window. Carries Expr, Start, End, a metrics-only Step, and a Limit (series for metrics, lines for logs). | ../../../internal/observability/query/service.go (RangeQuery) |
| Result | The raw outcome of a backend query: the HTTP status the backend returned and its verbatim JSON body. Shaping it is a higher layer's concern. | ../../../internal/observability/query/service.go (Result) |
| MetricsQuerier | The outbound port for the Mimir metrics backend — Instant and Range. The Mimir adapter implements it; the service depends only on the port so a unit test can substitute a fake. | ../../../internal/observability/query/service.go (MetricsQuerier) |
| LogsQuerier | The outbound port for the Loki logs backend — Range only (logs are always a range query). | ../../../internal/observability/query/service.go (LogsQuerier) |
| Config | The operator-supplied backend configuration — MimirQueryURL, LokiQueryURL, HTTPTimeout. An empty URL field disables that backend's query path. Validate rejects a malformed URL up front; WithDefaults normalises a zero timeout. | ../../../internal/observability/query/config.go (Config) |
| Query-outcome sentinels | The two sentinels every backend round-trip is classified into — ErrQueryBackendUnavailable (retryable) and ErrQueryRejected (terminal). Callers detect them with errors.Is. | ../../../internal/observability/query/classify.go (ErrQueryBackendUnavailable, ErrQueryRejected) |
The hard query bounds
The context caps how large a single query may be so one operator query cannot exhaust the shared bundled backends before the backend ever rejects it. The caps are enforced as free functions in the domain layer (ValidateQuery, ValidateTimeRange, ValidateLimit) — failing fast at the boundary keeps the blast radius to the single bad request.
| Bound | Constant | Value | Enforced on |
|---|---|---|---|
| Max query length | MaxQueryLength | 4096 runes | The PromQL / LogQL expression (counted in runes). |
| Max time range | MaxTimeRange | 31 days | A range query's window span. |
| Max series | MaxSeries | 10000 | A metrics range query's result limit. |
| Max lines | MaxLines | 5000 | A logs range query's result limit. |
ValidateTimeRange additionally requires End to be strictly after Start, and ValidateQuery rejects an empty or whitespace-only expression. A zero Limit is allowed and means the caller imposes no extra cap of its own. The application service runs the matching subset of these validators before forwarding: an instant metrics query validates the expression; a range query also validates the window and the result limit (against MaxSeries for metrics, MaxLines for logs).
The transport layer re-declares the same expression-length (4096) and time-window (31-day) bounds and enforces them up front, so a malformed query yields a deterministic 400 without a backend round-trip; the domain validators run again as defence-in-depth.
Upstream-error classification
Every backend round-trip is classified by classifyResponse into exactly one outcome, mirroring the routing context's classifier:
| Outcome | Condition |
|---|---|
| success | a 2xx (nil error). |
ErrQueryBackendUnavailable | a transport error, a 429, or any 5xx — retryable. |
ErrQueryRejected | any other non-2xx (a non-429 4xx) — terminal, never retried. |
A non-429 4xx is terminal because the backend will never accept this query (a parse error, an unknown tenant), so retrying it is pointless; a 429 and a 5xx are transient (rate limit, backend restart) and a transport error is by nature retryable.
No upstream body leak. On a non-2xx the adapter still returns the Result (status plus body) so a caller inside the process can inspect the backend's error body alongside the classified sentinel — but at the transport boundary the 500 arm never interpolates the underlying error text into the wire body. A wrapped sentinel chain or a backend message can carry internal topology a caller has no right to see, so the transport logs it internally and keeps the wire body generic.
The backend adapters
Two HTTP adapters sit behind the ports. Each takes an injected*http.Client so all transport policy (timeouts, TLS) lives at the composition root and the adapters stay testable against an httptest server.
- Mimir (
../../../internal/observability/query/mimir.go).Instantissues a GET to<base>/api/v1/querycarrying the expression and the evaluation time as unix seconds;Rangeissues a GET to<base>/api/v1/query_rangecarrying the window bounds as unix seconds and the step in seconds. - Loki (
../../../internal/observability/query/loki.go).Rangeissues a GET to<base>/loki/api/v1/query_rangecarrying the window bounds as unix nanoseconds (Loki's contract — encoding them as seconds would collapse the whole window to one instant) and the line limit.
All three are read-only GETs.
The HTTP surface
The two read operations live under /v1/domains/{domainId} and are implemented by the anti-corruption transport package ../../../internal/transport/http/v1/observability (the same package that owns the ingest front door, with the query handlers prefixed codeQuery*).
| Operation | Method | Path |
|---|---|---|
QueryDomainMetrics | GET | /v1/domains/{domainId}/metrics/query |
QueryDomainLogs | GET | /v1/domains/{domainId}/logs/query |
The wire-contract origin is ../../../api/openapi/plexsphere-v1.yaml; this doc is a map of the surface, not a duplicate of the schema.
Instant versus range on the metrics path. QueryDomainMetrics selects an instant query when the time parameter is present (it ignores the range triple); otherwise it is a range query and start, end, and step are all required and bounded. The logs path is always a range query — the parameter type makes start / end required, so the only window guard is the end-after-start and 31-day-span check. The logs path defaults an absent limit to 100 and an absent direction to backward (newest first).
Order of checks. Both handlers run the same oracle-prevention order: a 501 when the surface is unwired, a 401 when the request carries no authenticated principal, a 400 when the addressed Domain id is the zero UUID, the ReBAC read check on the addressed Domain before any backend forward (so an unauthorised caller never observes the existence side-channel a forward-then-check flow would leak), then a 400 on an empty / oversized query or a malformed range window, and finally the forward and the upstream-error mapping.
Server-side tenant injection
The single most important security property of the read path: the upstream tenant header X-Scope-OrgID is injected server-side, never forwarded from a client. The handler resolves the addressed Domain from the URL path-id and clears the per-Domain read ReBAC gate; the composition-root adapter then stashes that authorised Domain id on the request context, and a tenantRoundTripper stamps it as the upstream X-Scope-OrgID so each tenant's series and streams come only from that tenant's isolated store.
The handler never reads or forwards any client-supplied tenant header. Forwarding a caller-supplied X-Scope-OrgID was deliberately rejected: it would let an authenticated caller read another tenant's metrics by spoofing the header. The tenant boundary is derived from the authorised path-id, not from caller-controlled input — see the DECISION block in ../../../cmd/plexsphere/observability_query_factory_prod.go.
Error-code taxonomy
Every failure surface carries a stable Problem.code. The closed set the two query handlers emit, with its HTTP status:
| HTTP | Problem.code | Meaning |
|---|---|---|
400 | query_invalid | The query expression is empty or exceeds the maximum length. |
400 | query_range_invalid | A range window is malformed (missing / non-positive step, end not after start, or over the 31-day span). |
400 | invalid_domain_id | The addressed Domain id is the zero UUID. |
401 | unauthorized | The request carries no authenticated principal. |
403 | (PermissionDenied) | The caller lacks read on the addressed Domain. |
502 | query_upstream_unavailable | The backend round-trip was retryable (a 5xx, a 429, or a non-deadline transport error). |
504 | query_upstream_timeout | The backend round-trip exceeded the query deadline. |
501 | observability_query_not_provisioned | The surface is not wired (no backend URL configured). |
500 | internal | An unexpected fault; the wire body stays generic and never leaks the upstream body. |
The transport-local sentinels the production adapter translates the domain errors onto are ErrQueryUpstreamUnavailable (→ 502), ErrQueryUpstreamTimeout (→ 504), and ErrQueryBackendNotConfigured (→ 501), declared in ../../../internal/transport/http/v1/observability/query_errors.go. The timeout check runs first in the adapter's translation: a context.DeadlineExceeded matches both the deadline and the backend-unavailable sentinel, so checking the timeout first preserves the 504 signal an operator needs to distinguish a slow backend from a failing one.
ReBAC posture
Reading a Domain's metrics or logs is a Domain read, so both operations gate on the read relation against the canonical domain:<uuid> object, before any backend forward. The codebase gates on the schema relation read; the OpenAPI 403 prose names the operator- facing label domain-view — recorded in a DECISION block in ../../../internal/transport/http/v1/observability/query_errors.go.
Audit contract
The query surface emits exactly one audit row: the per-Domain read denial. When a caller lacks read on the addressed Domain, the handler writes one permission_denied row (relation domain.observability.query) before the 403 is flushed, stamping the missing relation into the row's caveat context. A nil sink degrades silently while the 403 gate still fires. A successful read is not audited — it forwards the caller's query to the backend and returns the verbatim envelope.
Unlike the ingest 202 receipt, a successful query response carries no no-store directive: an intermediary caching a read result is benign.
Composition root
The production wiring is assembled in ../../../cmd/plexsphere/observability_query_factory_prod.go. It is opt-in on the backend URLs: with neither URL set the surface stays on its 501 observability_query_not_provisioned stub. Each URL is read through the non-empty-raw guard, and the config is validated at build time so a typo'd backend URL surfaces before /readyz lights green rather than as a transient query failure on the hot path. Setting a URL with a nil authorizer is a composition-root mistake and fails the build (ErrObservabilityQueryAuthzCheckRequired).
| Env var | What it tunes | Default |
|---|---|---|
PLEXSPHERE_OBS_MIMIR_QUERY_URL | Grafana Mimir query API base URL; setting it enables the metrics instant / range query paths. | (none; empty disables) |
PLEXSPHERE_OBS_LOKI_QUERY_URL | Grafana Loki query API base URL; setting it enables the logs range query path. | (none; empty disables) |
The per-query HTTP round-trip timeout defaults to 30s, applied via WithDefaults — a code default, not an env knob.
Cross-references
./ingest.md— the ingest front door that admits and buffers the telemetry this proxy reads back../routing.md— the egress half that delivers telemetry to Grafana Mimir / Grafana Loki, the backends this proxy queries../alerts.md— the Alert Rule context whose stored signals name the same series an operator queries here../incidents.md— the Incident context an operator investigates with these queries.../../reference/cli/plexctl/metrics.mdand../../reference/cli/plexctl/logs.md— theplexctlquery CLI references.../index.md— the bounded-contexts landing page.../../../internal/observability/query— the bounded-context root that pins the ubiquitous language.../../../api/openapi/plexsphere-v1.yaml— the OpenAPI spec the two query operations originate from.