Skip to content

CI pipeline — jobs, triggers, and reproducing failures

This document is the canonical contributor guide for plexsphere's CI pipeline. It maps each job in .github/workflows/ci.yaml (and the fast-feedback .github/workflows/pr-smoke.yaml) to the laptop-mirrored make target that reproduces it byte-for-byte, names the tools each job needs on the runner, lists the artefacts the job publishes to the run summary, and records the branch-protection check name reviewers gate merges on. It then lists how to reproduce a red run locally for each of the per-tool failure modes reviewers see most often.

The pipeline is intentionally flat: every Makefile-driven job is a thin wrapper over the target of the same name, so make <job> on a fresh clone is the full reproduction story. The drift gate in tests/docs/ci_doc_drift_test.go asserts this document lists every workflow job and that every make command in the table below resolves to a real Makefile target — documenting a job here without wiring the target, or renaming the target without updating this file, fails CI.

The companion context for this document is docs/contributing/testing.md (the test-pyramid guide) and docs/contributing/toolchain.md (the pinned-tool and SHA-pinning policy this pipeline enforces).

Overview

Two workflows run on every pull request:

  • ci.yaml — the authoritative gate matrix. Runs on every pull_request (plus a nightly schedule and workflow_dispatch for the fuzz and dev-smoke lanes); every job is required by branch protection. There is deliberately no push trigger: a commit on main only lands through a merged PR whose run was already green, so re-running the validation matrix post-merge adds no signal. Future artifact-publishing jobs (container-image push, Helm-chart push) introduce their own push-gated lane when they arrive; the drift gate TestCIWorkflow_TriggersOnPullRequestNotPush in tests/workspace/ci_workflow_test.go blocks a bare push trigger on the validation matrix. Each job pins the Go toolchain via go-version-file: .go-version (the single source of truth enforced by tests/workspace/goworkdrift_test.go) and every third-party action pins a 40-hex commit SHA (policy recorded in docs/contributing/toolchain.md).
  • pr-smoke.yaml — the sub-minute feedback gate. Runs on pull_request only, cancels on force-push, and executes EXACTLY make lint, make tidy-check, and go build ./.... It is advisory-but-required: branch protection points at the check name pr-smoke so a reviewer sees a fast red signal before the full matrix completes.

The Makefile-keyed jobs below mirror the requiredCIJobs map in tests/workspace/ci_workflow_test.go: drift between the two is a test failure, not a style note.

Path-based CI skipping

CI has two layers of path-based filtering so a commit only pays for the jobs it actually affects. The rules below apply to both ci.yaml and pr-smoke.yaml.

Layer 1 — paths-ignore (skip the entire workflow)

paths-ignore is evaluated by GitHub before any job starts. A commit whose changed files all match paths-ignore triggers zero runner minutes, zero queue time, and zero notifications.

Path patternSkipped in ci.yamlSkipped in pr-smoke.yamlRationale
.planwerk/**Planning JSON — no runnable artefacts, no code surface. A plan-only commit used to trigger the full matrix (~15 jobs); now it costs nothing.
docs/**pr-smoke tests code; docs-only PRs are handled by docs-check in ci.yaml.
tools/docs/**Same reasoning as docs/**.
**/*.mdCatches README.md, CLAUDE.md, inline docs.

Layer 2 — the changes: gate (per-job skipping)

Inside ci.yaml a cheap changes: job runs dorny/paths-filter and emits boolean outputs. Every other job depends on changes and guards itself with an if: expression:

OutputFilterDownstream jobs that require it
code** minus .planwerk/**, docs/**, tools/docs/**, **/*.md, LICENSE*, CODE_OF_CONDUCT* (AND semantics — see below)lint, tidy, unit, vuln, race, integration, integration-cli, actionlint, authz-lint, hadolint, e2e, openapi-lint, generated-drift, image-scan, fuzz-selector, fuzz-signing-rotation
docsdocs/**, tools/docs/**, **/*.mddocs-check (also runs on code == 'true' so a Go-side rename that breaks a doc link still fires lychee)

code and docs are computed by two separate dorny/paths-filter steps because the two need opposite matching semantics. dorny/paths-filter compiles each pattern into its own matcher and, with its default predicate-quantifier: some, a file matches a filter when it matches any single pattern. That is correct for docs and the per-context filters (positive include lists), but it makes an "everything except docs" filter impossible: the leading ** matches every file, so the !docs/** / **/*.md negations never subtract and code would be true for every PR — a docs-only PR would run the full matrix. The code step therefore sets predicate-quantifier: every, switching to AND semantics: a file matches code only when it satisfies all patterns (matches ** and is not under docs/, tools/docs/, .planwerk/, and is not a Markdown / LICENSE / CODE_OF_CONDUCT file).

The code filter covers the signing bounded context (internal/signing/**) via the unconditional ** glob — a change to any signing source, proto, or migration re-runs unit, integration, and generated-drift. The drift gate TestCIWorkflow_SigningPathsCoveredByCodeFilter rejects any PR that adds an !internal/signing/** negation or moves the code filter back to the default some quantifier.

What this means in practice:

  • Plan-only commitpaths-ignore skips both workflows entirely. No jobs run at all.
  • Docs-only PRpr-smoke is skipped by paths-ignore. In ci.yaml, only the changes: gate and docs-check run (all other jobs see code == 'false' and skip).
  • Code-only commitchanges: emits code=true, docs=false. Every job except docs-check would run on code-only; docs-check also runs because its guard is docs OR code.
  • Mixed commit — both outputs are true; the full matrix runs.

Adding a new job

  1. Add the job under jobs: in ci.yaml with needs: [changes] and an if: expression referencing needs.changes.outputs.<flag>. The drift gate TestCIWorkflow_DownstreamJobsGuardOnChanges fails the PR if either piece is missing.
  2. If the new job needs its own change-category flag, extend both the filters: block and the outputs: map in changes:. The drift gate TestCIWorkflow_ChangesJobShape keeps them in sync.
  3. Update the per-job table below and, when relevant, the table above.
  4. If the job is expensive (testcontainers, kind, image build), stage it behind the fast gates — see Staged execution below — instead of letting it fan out in parallel.

Forcing a full run

If you need the full matrix despite a docs-only diff (e.g. you want to dry-run a failing image-scan), amend the branch with a whitespace-only change to any code-matched file — the changes: gate will flip code to true and every job will run.

Staged execution (fast gates before expensive jobs)

The changes: gate decides whether a job is relevant to the diff, but it does not order the jobs. Historically every relevant job fanned out in parallel the instant changes finished, so a red lint still burned a runner on the full testcontainers / kind / image fleet in parallel before the cheap signal ever came back.

The expensive jobs are therefore staged behind a tier of cheap, fast-feedback gates. A gate is a job that stays gated on changes alone; an expensive job lists the relevant gates in its needs: and only consumes a runner once they are green. The gate sets are:

Gate setGate jobsExpensive jobs that wait for it
Go fast gateslint, tidy, unit, cli, generated-driftvuln, race, integration, integration-cli, e2e, image-scan, credentials-broker, cloud-credentials, management-fleet, blueprints, provisioning-broker, resource-adoption-migration, dr

The remaining cheap jobs (actionlint, authz-lint, hadolint, openapi-lint, docs-check) and the opt-in lanes (fuzz-*, dev-smoke, dev-stack-smoke) keep needs: [changes] — gating them buys little and only adds latency.

Each staged job pairs the gate dependencies with a status-guarded if::

yaml
needs: [changes, lint, tidy, unit, cli, generated-drift]
if: >-
  always() &&
  needs.changes.outputs.code == 'true' &&
  !contains(needs.*.result, 'failure') &&
  !contains(needs.*.result, 'cancelled')
  • always() keeps the job reachable when an upstream gate was skipped (e.g. a schedule / workflow_dispatch event where code is unset); without it a skipped dependency would cascade-skip the job through the implicit success().
  • !contains(needs.*.result, 'failure') short-circuits the expensive run the moment any gate fails — that is the resource saver.
  • !contains(needs.*.result, 'cancelled') keeps a cancelled run from spinning up new expensive work.
  • The path-filter clause (needs.changes.outputs.<flag>) stays so a docs-only or out-of-scope PR still skips the job entirely.

A failed gate stays red and blocks the merge on its own; the expensive jobs it gates are skipped (neutral), so branch protection is still satisfied only when the cheap signal is green. The staging is pinned by the drift gate TestCIWorkflow_ExpensiveJobsAwaitFastGates in tests/workspace/ci_workflow_test.go: dropping an expensive job back to needs: [changes], or removing a status guard, fails the PR. When you add a new expensive job, list its fast gates in needs:, add the three status guards to its if:, and extend expensiveJobGates in that test.

The jobs

Job IDTriggerLocal commandRequired toolsArtefactsBranch-protection name
changesPR— (path-filter only)dorny/paths-filter (action)changes
lintPRmake lintGo 1.26, golangci-lint (source-built)lint
tidyPRmake tidy-checkGo 1.26tidy
unitPRmake testGo 1.26Codecov upload (coverage/coverage.out)unit
vulnPRmake vulnGo 1.26, govulncheckvuln
racePRmake test-raceGo 1.26race
integrationPRmake test-integrationGo 1.26, Docker enginetestcontainers-log artefactintegration
actionlintPRmake actionlintGo 1.26, actionlintactionlint
authz-lintPRmake authz-lintGo 1.26, zed (source-built, pinned by ZED_VERSION)authz-lint
hadolintPRmake hadolintDocker (hadolint image)hadolint
e2ePRmake e2eGo 1.26, kind, chainsaw, Dockere2e
openapi-lintPRmake openapi-lintGo 1.26, Node (.nvmrc), Spectralopenapi-lint
generated-driftPRmake generate-checkGo 1.26, Node (.nvmrc)generated-drift
docs-checkPRmake docs-checkNode (.nvmrc), lychee (Rust release binary or Docker fallback)docs-check
sbomtag push v* + workflow_dispatch (in release.yaml, not ci.yaml)make sbomGo 1.26, Docker, syftsbom-plexsphere, sbom-plexsphere-signer (SPDX-JSON)— (release lane, not branch-protection required)
image-scanPRmake image-scanGo 1.26, Docker, trivySARIF under bin/scan/*.sarif (upload to Code Scanning temporarily disabled, see §Reproducing failures)image-scan
cliPRmake plexctl-build && go test ./cmd/plexctl/...Go 1.26Codecov upload (coverage/plexctl-coverage.out, flag plexctl)cli
integration-cliPRmake test-cli-integrationGo 1.26integration-cli
fuzz-selectorPR (30s budget) + schedule (5 min budget) + workflow_dispatchmake fuzz-selectorGo 1.26Fuzz-corpus cache (~/.cache/go-build/fuzz)fuzz-selector
fuzz-signing-rotationPR (30s budget) + schedule (5 min budget) + workflow_dispatchmake fuzz-signing-rotationGo 1.26Fuzz-corpus cache (~/.cache/go-build/fuzz)fuzz-signing-rotation
credentials-brokerPR (gated by credentials_broker paths-filter output)make test-credentials-brokerGo 1.26, Docker enginecredentials-broker
cloud-credentialsPR (gated by cloud_credentials paths-filter output)make test-cloud-credentialsGo 1.26, Docker enginecloud-credentials
management-fleetPR (gated by management_fleet paths-filter output)make test-management-fleetGo 1.26, Docker enginemanagement-fleet
blueprintsPR (gated by blueprints paths-filter output)make test-blueprintsGo 1.26, Docker engineblueprints
provisioning-brokerPR (gated by provisioning_broker paths-filter output)make test-provisioning-brokerGo 1.26, Docker engineprovisioning-broker
resource-adoption-migrationPR (gated by resource_adoption_migration paths-filter output)make test-resource-adoption-migrationGo 1.26, Docker engineresource-adoption-migration
drPR (gated by dr paths-filter output)make test-drGo 1.26, Docker enginedr
pr-smokePR onlymake lint && make tidy-check && go build ./...Go 1.26, golangci-lintpr-smoke
dev-smokeschedule (nightly 03:00 UTC) + workflow_dispatch + PR only when labelled devmake dev (then make dev-down)Go 1.26, kind, kubectl, Docker— (not branch-protection required)
dev-stack-smokeschedule (nightly 03:00 UTC) + workflow_dispatch + PR only when labelled dev-stackmake dev (then golden-flow chainsaw, then make dev-down)Go 1.26, kind, kubectl, Docker, chainsaw— (not branch-protection required)

Every row whose local command starts with the make command resolves to a target declared in Makefile; the drift test TestCIDocLocalCommandsResolveToMakeTargets enforces that contract. The one non-Make row (pr-smoke) is exempt because its CI shape is guarded by a dedicated test (the pr_smoke_workflow_test.go suite). The dev-smoke row is deliberately NOT in the requiredCIJobs map — it runs on a schedule and on label-gated PRs only, documented in docs/tutorials/set-up-local-plexsphere.md and guarded by TestCIWorkflow_DevSmokeJobShape / TestCIWorkflow_DevSmokeAlwaysRunsTeardown. The dev-stack-smoke row follows the same posture: schedule + opt-in PR label only, NOT branch-protection required, documented alongside the dev-stack contract in docs/reference/dev-stack/index.md, and guarded by the tests/workspace/dev_stack_ci_test.go shape gates.

The sbom row lives in a separate workflow, .github/workflows/release.yaml, and is the one job that does NOT run on a pull request. An SBOM is a record of what is actually shipped, so it is generated at the moment a release tag (v*) is cut — not on every code PR, where the artefact had no consumer and paid a full docker-build for nothing. ci.yaml's deliberate "no push trigger" posture reserved exactly this push-gated artifact lane; the release sbom job fills it. It is NOT in the requiredCIJobs map and is NOT branch-protection required; its shape is guarded by tests/workspace/release_workflow_test.go. The durable step — attaching each SBOM to the published image / GitHub Release as an in-toto attestation — lands together with the image-publish pipeline; until then the tag run uploads the two SPDX-JSON files as workflow artefacts.

Caching and runtime budget

The workflows are aggressively cached so a warm PR reruns in a fraction of the cold-run cost. The following table lists each cache layer, the path it covers, the invalidation key, and where the drift-gate lives when one is installed. Removing or narrowing any of the caches below is a runtime regression — revert or open a ticket before landing.

LayerJobsPathInvalidation keyDrift gate
Go module + build cacheevery Go job~/go/pkg/mod, ~/.cache/go-buildgo.sum (via actions/setup-go's built-in cache: true)— (setup-go internal)
GOBIN (installed Go tools)lint, vuln, actionlint, authz-lint, image-scan, generated-drift, pr-smoke, release sbom~/go/binhashFiles('.go-version', 'Makefile') per-job scope (go-bin-<job>-…)
npm global cacheopenapi-lint, generated-drift, docs-checksetup-node's managed cache pathRespective package-lock.json via cache: npm + cache-dependency-path
Lychee binarydocs-check~/.local/bin/lycheeenv.LYCHEE_VERSION
Trivy vulnerability DBimage-scan~/.cache/trivyStable trivy-db-${{ runner.os }} (trivy's own 24 h freshness check rolls the DB forward on-disk)TestCIWorkflowTrivyDBCacheKeyIsStable
Buildx layer cache (GHA backend)image-scan (ci.yaml), sbom (release.yaml)GitHub Actions cache (type=gha)Per-component scope docker-plexsphere / docker-plexsphere-signer; mode=maxTestCIWorkflowBuildxCachedJobsSetUpBuildx (image-scan), TestReleaseJobSetsUpBuildxAndGHACache (release sbom)
Concurrency (PR supersession)every ci.yaml / pr-smoke.yaml job${{ github.workflow }}-${{ github.ref }} with cancel-in-progress: ${{ github.event_name == 'pull_request' }}TestCIWorkflowDeclaresConcurrencyGroup

Key invariants worth calling out:

  • Concurrency cancels PR runs, NOT main pushes. Cancelling on main would leave holes in the CI history and hide regressions. The drift gate TestCIWorkflowDeclaresConcurrencyGroup rejects a bare cancel-in-progress: true.
  • Per-job GOBIN cache scopes (go-bin-lint-…, go-bin-vuln-…, …) avoid cross-job thrashing. A shared namespace would cause jobs installing different tool subsets to overwrite each other's cache, forcing reinstalls on the next run.
  • Buildx GHA cache requires docker/setup-buildx-action. The stock docker driver on ubuntu-latest supports --load but not --cache-to type=gha; the image-scan job (and the release lane's sbom job) set up buildx first, then set BUILDX_GHA_CACHE=1 on the make step so Makefile's docker-build target expands per-component --cache-from / --cache-to type=gha,mode=max flags. Dropping either half silently turns the cache into a no-op — the drift gate TestCIWorkflowBuildxCachedJobsSetUpBuildx rejects that.
  • The lint job checks out at the default depth (1). golangci-lint does not consult git history (no new: / new-from-rev: / revgrep entries in .golangci.yml), so a full-history clone would be pure waste. TestCIWorkflowLintJobOmitsFetchDepth prevents a drift back to fetch-depth: 0.

Clearing a cache while debugging

If a cache entry is ever poisoned (e.g. a bad golangci-lint binary, a corrupt trivy DB shard), rotate the relevant pin rather than trying to delete the cache from the GitHub UI — a fresh key forces a fresh entry and the LRU eviction takes care of the old one:

  • GOBIN cache — touch Makefile (any change invalidates the key).
  • npm cache — re-run npm install locally, commit the updated package-lock.json.
  • Lychee binary — bump LYCHEE_VERSION in both the docs-check job's job-level env: block and Makefile.
  • Trivy DB — the cache key is stable (trivy-db-<os>), so a poisoned entry does NOT roll over automatically. Append a suffix to the key in ci.yaml's Cache Trivy vulnerability database step (e.g. trivy-db-${{ runner.os }}-v2) and leave restore-keys untouched so everyday runs keep warming from the prior snapshot. Trivy's own 24 h freshness check refreshes the on-disk DB between key rotations, so this is only needed after a DB-schema break.
  • Buildx GHA cache — force a cache miss by changing the scope name in Makefile's BUILDX_CACHE_ARGS_* variables, or use the gh cache delete CLI.

Reproducing failures

Each subsection below walks through the minimum steps to reproduce a red run locally. The goal is a sub-minute loop for the fast jobs and a sub-ten-minute loop for the container-heavy ones.

lint

bash
# Install the pinned golangci-lint (same line CI runs — see ci.yaml).
GOTOOLCHAIN=local go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.11.4

# Reproduce:
make lint

make lint runs golangci-lint on the root module, then make depguard-all, then the meta-tests under tests/workspace/ and tests/docs/. If a per-file finding is too noisy to fix inline, read docs/contributing/toolchain.md#static-analysis-golangci-lint for the //nolint:<name> // <reason> escape hatch — the reason is required, and reviewers will push back on a bare disable.

make depguard-all is a per-module sweep: golangci-lint run ./... is module-scoped, so a single root invocation lints only the root module and never reaches the bounded-context (internal/<ctx>), cmd/, or tests/ modules — each of which is its own Go module. The sweep iterates go list -m and runs golangci-lint with --enable-only depguard inside every workspace module, so the cross-context, persistence, and net/http boundary rules in .golangci.yml are actually enforced where they apply. To reproduce a depguard-only failure without the rest of the lint gate:

bash
make depguard-all

openapi-lint

bash
# The Node toolchain is pinned by the repo-root .nvmrc the job reads.
nvm use --install "$(cat .nvmrc)"

# Reproduce:
make openapi-lint

make openapi-lint drives Spectral against api/openapi/plexsphere-v1.yaml with the ruleset at tools/openapi/.spectral.yaml. A failure here is authored, not generated — fix the YAML. The authoring workflow is documented at docs/contributing/openapi.md.

kind / chainsaw (e2e)

bash
# The Makefile builds plexsphere images, boots kind, and side-loads
# the e2e images before chainsaw runs — no extra setup needed.
make e2e

A failed chainsaw step prints the failing manifest path and the namespace; re-running with chainsaw test --test-dir <dir> narrows the loop while you iterate. The image-load scripts tests/e2e/bootstrap/kind-load.sh, tests/e2e/openapi/kind-load-no-v1.sh, and tests/e2e/messaging/kind-load.sh are idempotent, so you can rerun them without tearing the cluster down.

docs-check

bash
# Node toolchain from the canonical .nvmrc.
nvm use --install "$(cat .nvmrc)"

# Reproduce:
make docs-check

make docs-check runs markdownlint-cli2 over docs/**/*.md README.md CLAUDE.md and then lychee over the same set to verify every relative link resolves. The lychee binary is resolved in priority order: a local release binary, then a pinned go install build, then the official Docker image fallback. Each failure line names the offending file, the line number, and either the broken link or the markdown rule that fired — fix the file and re-run the target until the output is clean.

sbom

bash
# Docker must be running — `make sbom` calls `$(MAKE) docker-build`
# to materialise the two images syft scans.
make sbom

make sbom does NOT run on a pull request — it runs in the release lane (.github/workflows/release.yaml) on a v* tag push (or workflow_dispatch), because an SBOM is a record of what is actually shipped and only means something at the tag. The make recipe is identical whether you invoke it on a laptop or the release lane invokes it.

make docker-build tags each built image TWICE: once with the branch-tracking :$(VERSION) reference, and once with the stable :$(CI_TAG) alias (default ci). make sbom scans the :ci tag so the scanned reference is byte-identical across a laptop run and a CI run regardless of the repo's VERSION state. Override the alias with make sbom CI_TAG=<alias> when experimenting, but the committed default is what the workspace tests assert. The target emits bin/sbom/plexsphere.spdx.json and bin/sbom/plexsphere-signer.spdx.json; the release-lane job uploads each as a named artefact (sbom-plexsphere, sbom-plexsphere-signer) so Dependency-Track-style ingestion tools can consume one SBOM per image. A red run still ships whatever syft emitted before the failure — the two upload steps are unconditional on purpose.

image-scan

bash
# Same docker dependency as sbom — make image-scan shares the
# docker-build prerequisite.
make image-scan

make image-scan runs Trivy at --severity HIGH,CRITICAL --exit-code 1 --ignore-unfixed --ignorefile.trivyignore --format sarif against the stable plexsphere:ci / plexsphere-signer:ci alias make docker-build produces (see the sbom section above for the tag rationale) and prints SARIF to bin/scan/<image>.sarif. A non-zero exit is the gate firing on a real finding — suppressing it is a documented policy change that requires a justified entry in .trivyignore (the tests/workspace/trivyignore_test.go gate rejects an un-justified entry). The CI job would normally upload SARIF to the security tab via github/codeql-action/upload-sarif even on a failing run so findings remain visible, but that step is temporarily commented out in .github/workflows/ci.yaml because this repository is private and does not have GitHub Advanced Security enabled (upload-sarif fails with Advanced Security must be enabled for this repository to use code scanning). The HIGH/CRITICAL gate still fires in CI — only the Security-tab dashboard is unavailable until the repo goes public or Advanced Security is licensed. Re-enable the upload by uncommenting the two Upload Trivy SARIF steps in the image-scan job.

generated-drift

bash
# Same Node toolchain as openapi-lint.
nvm use --install "$(cat .nvmrc)"

# Regenerate locally, then re-check:
make generate
make generate-check

A failure means the committed OpenAPI Go artefacts (the generated types, server interface, and client) do not match a fresh make generate run. The fix is always to regenerate, stage the diff, and push — never to hand-edit a generated file. See docs/contributing/openapi.md for the full authoring loop.

Cross-references