Skip to content

Alert Rules — per-Domain stored alerting definitions

This document is the authoritative bounded-context reference for the Alert Rule context — the half of the observability stack where a Domain declares the alert rules its telemetry should be evaluated against. An alert rule pairs an observability signal (a metric or log series expression in the backend's query language) with a comparator and a numeric threshold, so the rule describes the crossing a downstream evaluator should fire on, at a declared severity. The context owns the AlertRule aggregate, its value objects, the per-Domain name-uniqueness invariant, the five CRUD HTTP operations, and the persistence table that durably stores the operator's alerting intent. The domain root that pins the ubiquitous language is ../../../internal/observability/alerts.

The platform stores rules; it does not evaluate or fire them. This is the single most important scope boundary of the context. plexsphere is the durable system of record for the alert definitions an operator authors — their signal, comparator, threshold, severity, and enabled state — and nothing more. The evaluation that watches a live metric series, decides a threshold has been crossed, and raises a firing alert lives entirely downstream in the bundled Grafana Mimir / Grafana stack, which reads the telemetry the ingest and routing halves deliver. A reader should never describe this context as evaluating, triggering, or firing a rule: it persists the rule and serves it back over /v1, and the enabled flag merely records whether the operator wants the rule to participate in that downstream evaluation without forcing a delete and re-author.

Ubiquitous language

The terms below travel verbatim across the domain root, the aggregate, the value objects, the application service, the repository port, and the transport surface. Documentation, JSON fields, and persisted columns adopt the exact spelling.

TermDefinitionCode anchor
AlertRuleThe aggregate root: one Domain-scoped alert rule. It records the Signal it watches, the Comparator and Threshold that describe the crossing it should be evaluated for, the Severity it should be raised at, whether it is enabled, and its persistence timestamps. Fields are unexported so the invariants are only established through New (a fresh rule) or Hydrate (a persisted one).../../../internal/observability/alerts/alert_rule.go (AlertRule)
SignalThe metric or log series expression the rule watches, expressed in the backend's query language. It is a free-form bounded string (non-empty after trimming, at most 1024 runes) — the context does not parse or evaluate it, it stores it verbatim for the downstream evaluator.../../../internal/observability/alerts/alert_rule.go (AlertRule.Signal)
ComparatorThe closed set of crossing directions a rule's Signal is compared against its Threshold with — gt, gte, lt, lte. String-backed so it JSON-encodes and persists as its wire token; the zero value is invalid by construction.../../../internal/observability/alerts/types.go (Comparator)
ThresholdThe numeric value the Signal is compared against. A float64 that must be finite — NaN and both infinities are rejected so a non-finite threshold can never be persisted.../../../internal/observability/alerts/validation.go (validateThreshold)
SeverityThe closed set of severities a rule is raised at — info, warning, critical. String-backed; the zero value is invalid by construction.../../../internal/observability/alerts/types.go (Severity)
enabledA plain boolean recording whether the operator wants the rule to participate in downstream evaluation. A fresh rule is enabled; toggling it never fails (it carries no invariant beyond the type).../../../internal/observability/alerts/alert_rule.go (AlertRule.SetEnabled)
nameThe Domain-unique logical name a rule is addressed and displayed by. Non-empty after trimming, at most 200 runes, and unique within a Domain — the persistence layer enforces the (domain_id, name) uniqueness.../../../internal/observability/alerts/validation.go (validateName)
RepositoryThe aggregate-shaped outbound persistence port the application service drives. Every method takes or returns whole AlertRule aggregates, never row structs, so the domain stays free of persistence concerns.../../../internal/observability/alerts/ports.go (Repository)
AuditSinkThe optional outbound audit port the application service records each mutating decision through. A nil sink is tolerated and the service degrades silently.../../../internal/observability/alerts/ports.go (AuditSink)

The aggregate shape

An AlertRule carries exactly these fields, all unexported behind accessor methods:

FieldTypeMeaning
idAlertRuleIDThe app-minted UUIDv7 identifier.
domainIDDomainIDThe owning Domain (an external reference to the identity context's Domain aggregate).
namestringThe Domain-unique logical name.
signalstringThe metric / log series expression.
comparatorComparatorThe crossing direction (gt / gte / lt / lte).
thresholdfloat64The finite numeric breach point.
severitySeverityThe severity (info / warning / critical).
enabledboolWhether the rule participates in downstream evaluation.
createdAt / updatedAttime.TimePersistence timestamps; zero until persisted.

There is no expression / for-duration / labels triple on the aggregate: a rule is the (signal, comparator, threshold, severity, enabled) shape above, plus its name and Domain scope. The signal field is the query-language expression; the comparator and threshold are stored separately rather than folded into one expression string.

The two constructors. New builds a fresh rule, mints a UUIDv7 id, sets it enabled, and leaves the timestamps zero (the repository stamps them on insert). Hydrate rebuilds a persisted rule from a scanned row without re-minting the id. Both run the same field validators, so a corrupt row can never produce a half-formed aggregate.

The mutators. Rename, SetSignal, SetComparator, SetThreshold, and SetSeverity each validate before applying and leave the receiver untouched on rejection; SetEnabled is the one mutator that cannot fail.

Invariants and validation

Every create-time and hydrate-time rejection (except the structural zero-id / zero-domain guards Hydrate adds) wraps the single parent sentinel ErrInvalidAlertRule, so the transport boundary maps one sentinel onto one 400 Problem code.

InvariantRuleRejection
Domain scopedomainID must be non-zero.ErrInvalidAlertRule (New); a bare structural error (Hydrate).
NameNon-empty after trimming whitespace; at most 200 runes.ErrInvalidAlertRule
SignalNon-empty after trimming whitespace; at most 1024 runes.ErrInvalidAlertRule
ThresholdA finite float64 — NaN, +Inf, and -Inf are rejected.ErrInvalidAlertRule
ComparatorA member of the closed gt / gte / lt / lte set.ErrInvalidAlertRule (or ErrUnknownComparator from ParseComparator).
SeverityA member of the closed info / warning / critical set.ErrInvalidAlertRule (or ErrUnknownSeverity from ParseSeverity).
Name uniquenessEach (domain_id, name) pair is unique within a Domain.ErrAlertRuleConflict (enforced by the repository's unique index).

The bound constants are not operator knobs: the 200-rune name cap and the 1024-rune signal cap are domain constants that cap the persisted columns so an unbounded value can never reach the database.

The application service

The application service in ../../../internal/observability/alerts/services orchestrates the aggregate against the Repository port. It exposes Create, Get, List, Update, and Delete. The constructor panics on a nil repository (a composition-root wiring bug), while the audit sink, clock, and logger are nil-tolerated options. Two behaviours are worth pinning:

  • Update is a load-mutate-persist cycle over an optional-field patch. It reads the rule, applies each supplied patch field through the aggregate's mutators (a nil pointer leaves the field untouched), and persists the result. A validation error from a mutator surfaces before any persistence.
  • Mutations emit a post-persist audit row through the sink when one is wired (alerts.create / alerts.update / alerts.delete); a flaky audit backend is logged, never propagated, so it can never turn a successful mutation into a user-visible error.

The HTTP surface

The five CRUD operations live under /v1/domains/{domainId}/alert-rules and are implemented by the anti-corruption transport package ../../../internal/transport/http/v1/alerts. The transport package re-declares the domain port, the read-model DTO, and the error sentinels locally and never imports the domain module; the production adapter at the composition root translates the domain AlertRule and its sentinels onto the transport-local shapes.

OperationMethodPath
ListAlertRulesGET/v1/domains/{domainId}/alert-rules
CreateAlertRulePOST/v1/domains/{domainId}/alert-rules
GetAlertRuleGET/v1/domains/{domainId}/alert-rules/{alertRuleId}
UpdateAlertRulePATCH/v1/domains/{domainId}/alert-rules/{alertRuleId}
DeleteAlertRuleDELETE/v1/domains/{domainId}/alert-rules/{alertRuleId}

The wire-contract origin is ../../../api/openapi/plexsphere-v1.yaml; this doc is a map of the surface, not a duplicate of the schema.

Cursor pagination. ListAlertRules is paginated transport-side: the handler applies an opaque cursor over the rule slice the service returns, so the list read stays stable as a Domain's rule set grows.

Order of checks (mutations). A mutating handler runs a fixed gate order so an unauthorised caller never reaches the create or decode: a 501 when the surface is unwired (nil Service or Authz), a 401 when the request carries no authenticated principal, a 400 when the addressed Domain id is the zero UUID, the ReBAC manage check on the addressed Domain before the body is decoded, then a 400 on a malformed body, and finally the domain validation / name-conflict mapping or a 201 / 200 on success.

Error-code taxonomy

Every failure surface carries a stable Problem.code. The closed set this surface emits, with its HTTP status:

HTTPProblem.codeMeaning
400invalid_domain_idThe addressed Domain id is the zero UUID.
400invalid_alert_rule_idThe addressed alert-rule id is malformed.
400invalid_cursorThe list cursor is malformed.
400invalid_bodyThe request body is not a valid alert-rule document.
400alert_rule_invalidThe rule failed a domain validation rule (name, signal, threshold, comparator, or severity).
401unauthenticatedThe request carries no authenticated principal.
403(PermissionDenied)The caller lacks the required relation on the addressed Domain.
404alert_rule_not_foundNo alert rule with the requested id exists.
409alert_rule_name_conflictA create or rename collides with an existing rule's (domain, name) uniqueness.
501alert_rules_not_provisionedThe surface is not wired in this build.
500(generic)An unexpected server-side fault; the underlying error is logged, never interpolated onto the wire.

The transport-local sentinels the production adapter translates the domain errors onto are ErrAlertRuleNotFound (→ 404), ErrAlertRuleNameConflict (→ 409), and ErrAlertRuleInvalid (→ 400), declared in ../../../internal/transport/http/v1/alerts/errors.go.

ReBAC posture

Reading an alert rule gates the Domain read relation; mutating one (create / update / delete) gates the Domain manage relation. The ReBAC object is the canonical domain:<uuid>, and the check runs before any read or mutation so an unauthorised caller never observes the existence side-channel a forward-then-check flow would leak.

Note the relation names: the codebase gates on the schema relations read and manage (matching the existing Domain surfaces and the tenancy schema). The OpenAPI 403 prose names the required permissions with the operator-facing labels domain-view (reads) and domain-edit (mutations); the gate uses the schema names. This is recorded in a DECISION block in ../../../internal/transport/http/v1/alerts/errors.go.

Audit contract

The surface emits a canonical (subject, relation, object, outcome, correlation_id) audit tuple on two paths:

  • A denial — a caller attempting an operation it lacks the relation for — writes one permission_denied row before the 403 response is flushed, stamping the missing relation into the row's caveat context so an auditor can pivot from the denial to the gate that fired.
  • A successful mutation — create / update / delete — writes one granted row with the verb-style relation (alerts.create / alerts.update / alerts.delete).

A nil sink degrades silently — the audit row is dropped while the security gate still fires — and a sink error is logged through slog, never propagated to the caller.

Persistence

The single table is plexsphere.alert_rules, created by migration ../../../internal/platform/db/migrations/0052_alert_rules.sql. It is keyed on the app-minted UUID id, FKs the owning Domain ON DELETE CASCADE (a rule has no meaning once its Domain is gone), and enforces the (domain_id, name) uniqueness through a unique index. The comparator and severity columns carry SQL CHECK constraints pinning their closed sets, and enabled defaults to true. The migration's down arm refuses the downgrade with SQLSTATE 0A000: the table holds operator-authored alerting configuration, and dropping it would silently discard every authored rule.

Cross-references

  • ./ingest.md — the ingest front door that admits and buffers the telemetry a downstream evaluator reads when applying these rules.
  • ./routing.md — the egress half that delivers the buffered telemetry to Grafana Mimir / Grafana Loki, where rule evaluation actually happens.
  • ./incidents.md — the sibling Incident context, the operational record a firing alert may feed into.
  • ./query.md — the read-only metrics / logs query proxy an operator uses to explore the same series a rule's signal names.
  • ../../reference/cli/plexctl/alert.md — the plexctl alert CLI reference for managing rules from the terminal.
  • ../index.md — the bounded-contexts landing page.
  • ../../../internal/observability/alerts — the bounded-context root that pins the ubiquitous language.
  • ../../../api/openapi/plexsphere-v1.yaml — the OpenAPI spec the five CRUD operations originate from.