Skip to content

Incidents — per-Domain operational incidents with an append-only timeline

This document is the authoritative bounded-context reference for the Incident context — the per-Domain record of an operational incident and the append-only timeline of how it progressed. An incident carries a title, a severity, an open-then-resolved lifecycle, the instant it opened, the instant it resolved, and an ordered trail of timeline events (operator notes and status changes). The context owns the Incident aggregate, the TimelineEvent value object, the lifecycle state machine, and the two persistence tables. The domain root that pins the ubiquitous language is ../../../internal/observability/incidents.

The lifecycle has exactly two states: open and resolved. There is no acknowledged state. An incident is opened in the open state, accumulates timeline events while it is open, and transitions exactly once to the terminal resolved state. A reader should never describe an intermediate acknowledged step — the closed Status set in code is open | resolved and nothing more, and the persistence CHECK constraint encodes the same two values.

Ubiquitous language

The terms below travel verbatim across the domain root, the aggregate, the value objects, the application service, the repository port, and the transport surface. Documentation, JSON fields, and persisted columns adopt the exact spelling.

TermDefinitionCode anchor
IncidentThe aggregate root: one per-Domain operational incident. It records its title, severity, open -> resolved lifecycle, the instant it opened, the instant it resolved (nil while open), and an append-only timeline. Fields are unexported so the invariants are only established through Open (a fresh incident) or Hydrate (a persisted one).../../../internal/observability/incidents/incident.go (Incident)
StatusThe closed lifecycle state — open or resolved. The zero value is not a member, so IsValid rejects it. There is no acknowledged member.../../../internal/observability/incidents/types.go (Status)
SeverityThe closed severity classification — info, warning, critical. Defined locally in this context rather than imported from the alerts context, so the bounded contexts stay decoupled.../../../internal/observability/incidents/types.go (Severity)
TimelineEventA value object recording one entry on the incident's append-only timeline: a note or a status-change marker, scoped to its parent Incident and stamped with the moment it occurred. Its invariants are established through NewTimelineEvent or HydrateTimelineEvent.../../../internal/observability/incidents/timeline_event.go (TimelineEvent)
TimelineEventKindThe closed kind set for a timeline event — note (a free-form operator note) or status_change (a lifecycle-transition marker).../../../internal/observability/incidents/types.go (TimelineEventKind)
RepositoryThe aggregate-shaped outbound persistence port the application service drives. It exposes whole Incident aggregates (and their timeline) in domain terms, leaking no ORM or query-builder types.../../../internal/observability/incidents/ports.go (Repository)
AuditSinkThe optional outbound audit port the application service records each performed operation through. A nil sink is tolerated and the service degrades silently.../../../internal/observability/incidents/ports.go (AuditSink)

The aggregate shape

An Incident carries exactly these fields, all unexported behind accessor methods:

FieldTypeMeaning
idIncidentIDThe app-minted UUIDv7 identifier.
domainIDDomainIDThe owning Domain (an external reference to the identity context's Domain aggregate).
titlestringThe operator-facing label.
severitySeverityThe severity (info / warning / critical).
statusStatusThe lifecycle state (open / resolved).
openedAttime.TimeThe instant the incident opened.
resolvedAt*time.TimeThe instant it resolved; nil while open.
timeline[]TimelineEventThe append-only ordered trail.

A TimelineEvent carries its own id, the parent incidentID, the kind, a free-form message, and the occurredAt instant.

The lifecycle state machine and its invariants

The incident lifecycle is a two-state, single-transition machine, and its invariants are enforced on the aggregate (the application service holds no business rules of its own):

  • open (initial). Open builds a fresh incident in the open state with a nil resolvedAt and an empty timeline, after enforcing the creation invariants — a non-zero Domain, a non-empty title within the rune bound, a member severity, and a set openedAt.
  • resolved (terminal). Resolve transitions open -> resolved and stamps resolvedAt. It enforces the single-resolve guard: a resolve against an already-resolved incident returns ErrIncidentAlreadyResolved.

Three invariants bound the model:

InvariantRuleEnforced by
Append-only while openA timeline event may only be appended while the incident is open; an append against a resolved incident returns ErrIncidentResolved.Incident.AppendEvent
Single resolveAn incident transitions to resolved exactly once; a second resolve returns ErrIncidentAlreadyResolved.Incident.Resolve
Status / resolved-at XORstatus == resolved if and only if resolvedAt != nil. A corrupt row asserting one without the other is rejected on hydrate.Hydrate (and the persistence CHECK constraint)

The append-only ordered timeline. The timeline is never edited or pruned in place — it is the record of how an incident progressed. Events are appended in occurred-at order, the persisted trail is read in (incident_id, occurred_at) order, and the aggregate hands out a copy of its internal slice on Timeline() so a caller can never mutate the stored trail. The list projection of an incident drops the timeline entirely; only the single-incident read populates it.

Bounds. A title is non-empty after trimming and at most 200 runes; a timeline-event message is non-empty after trimming and at most 4000 runes. Both bounds cap the persisted columns and are domain constants, not operator knobs.

The application service

The application service in ../../../internal/observability/incidents/services orchestrates the aggregate against the Repository port. It exposes Open, AppendEvent, Resolve, Get, and List. The constructor panics on a nil repository (a composition-root wiring bug), while the clock, audit sink, and logger are nil-tolerated options. Two behaviours are worth pinning:

  • AppendEvent and Resolve are read-mutate-persist cycles. Each reads the incident first (surfacing ErrIncidentNotFound when absent), drives the mutation through the aggregate so the status guards fire, then persists.
  • Every operation emits a post-persist audit row through the sink when one is wired (incident.open / incident.append_event / incident.resolve); a flaky audit backend is logged, never propagated.

The HTTP surface

The five operations live under /v1/domains/{domainId}/incidents and are implemented by the anti-corruption transport package ../../../internal/transport/http/v1/incidents. The transport package re-declares the domain port, the read-model DTOs, and the error sentinels locally and never imports the domain module; the production adapter at the composition root translates the domain Incident / TimelineEvent and their sentinels onto the transport-local shapes.

OperationMethodPath
ListIncidentsGET/v1/domains/{domainId}/incidents
OpenIncidentPOST/v1/domains/{domainId}/incidents
GetIncidentGET/v1/domains/{domainId}/incidents/{incidentId}
AppendIncidentEventPOST/v1/domains/{domainId}/incidents/{incidentId}/events
ResolveIncidentPOST/v1/domains/{domainId}/incidents/{incidentId}:resolve

The wire-contract origin is ../../../api/openapi/plexsphere-v1.yaml; this doc is a map of the surface, not a duplicate of the schema.

Cursor pagination. ListIncidents is paginated transport-side over the header projection (the list omits the timeline), so the read stays stable as a Domain's incident set grows.

Error-code taxonomy

Every failure surface carries a stable Problem.code. The closed set this surface emits, with its HTTP status:

HTTPProblem.codeMeaning
400invalid_domain_idThe addressed Domain id is the zero UUID.
400invalid_incident_idThe addressed incident id is malformed.
400invalid_cursorThe list cursor is malformed.
400invalid_bodyThe request body is not a valid document.
400incident_invalidThe open-time body failed the aggregate's validation (bad title or severity).
400timeline_event_invalidThe append-time body failed the aggregate's validation (bad kind or message).
401unauthenticatedThe request carries no authenticated principal.
403(PermissionDenied)The caller lacks the required relation on the addressed Domain.
404incident_not_foundNo incident with the requested id exists.
409incident_resolvedA timeline append was attempted against a resolved incident.
409incident_already_resolvedA resolve was attempted against an already-resolved incident.
501incidents_not_provisionedThe surface is not wired in this build.

The transport-local sentinels the production adapter translates the domain errors onto are ErrIncidentNotFound (→ 404), ErrIncidentAlreadyResolved (→ 409), ErrIncidentResolved (→ 409), ErrIncidentInvalid (→ 400), and ErrTimelineEventInvalid (→ 400), declared in ../../../internal/transport/http/v1/incidents/errors.go.

ReBAC posture

Reading an incident (ListIncidents / GetIncident) gates the Domain read relation; mutating one (OpenIncident / AppendIncidentEvent / ResolveIncident) gates the Domain manage relation. The ReBAC object is the canonical domain:<uuid>, and the check runs before any read or mutation.

As with the sibling surfaces, the codebase gates on the schema relations read and manage, while the OpenAPI 403 prose names the operator- facing labels domain-view / domain-edit; this divergence is recorded in a DECISION block in ../../../internal/transport/http/v1/incidents/errors.go.

Audit contract

The surface emits a canonical (subject, relation, object, outcome, correlation_id) audit tuple:

  • A denial writes one permission_denied row before the 403 response is flushed, stamping the missing relation into the row's caveat context.
  • A successful mutation — open / append / resolve — writes one granted row with the verb-style relation (incident.open / incident.append_event / incident.resolve).

A nil sink degrades silently — the row is dropped while the security gates still fire — and a sink error is logged, never propagated.

Persistence

The context owns two tables, created by migration ../../../internal/platform/db/migrations/0053_incidents.sql:

  • plexsphere.incidents — one row per incident, keyed on the app-minted UUID id, FKing the owning Domain ON DELETE CASCADE. The severity and status columns carry SQL CHECK constraints pinning their closed sets (status IN ('open', 'resolved')), and a CHECK encodes the lifecycle XOR: resolved_at is non-NULL exactly when status is resolved, so a half-resolved row is structurally impossible at rest.
  • plexsphere.incident_timeline — one row per timeline event, keyed on the app-minted UUID id, FKing the owning incident ON DELETE CASCADE. The kind column carries a CHECK pinning note / status_change. Rows are only ever appended; a composite (incident_id, occurred_at) index backs both the cascade and the in-order trail read.

The migration's down arm refuses the downgrade with SQLSTATE 0A000: the two tables hold the incident lifecycle record and its append-only audit trail, which a post-incident review depends on.

Cross-references