Task 23 Execution Packet: Chaos Testing Suite (Aligned to 2026-01-25 Plan)

This packet replaces the old Task 23 notes. It is aligned to Phase 16 / Task 23 in docs/plans/2026-01-25-end-state.md and the current repo infrastructure.

Use this as the exact execution guide for an agent implementing Task 23.


Objective

Add a chaos test suite that validates platform invariants under failure:

  1. NATS partition across clusters (mesh)
  2. OPA restart during active policy-gateway traffic
  3. OpenObserve ingest stall (via otel-collector)
  4. Postgres restart with outbox recovery

Non-negotiables


Step 0 — Ground current infra reality (resolved)

Based on repo state:

Default ports for chaos stack (match dev unless mesh is used):


Step 1 — Add a dedicated chaos compose file

Create:

Include services (with profiles):

Rule: services must be profile-gated so scenarios run only what they need.

Port consistency: use the default ports listed in Step 0 unless a scenario explicitly requires alternate ports.


Step 2 — Harness structure

Create:

1
2
3
4
5
6
7
8
9
10
11
12
tests/chaos/
  run_chaos.py
  scenarios/
    nats_partition_mesh.py
    opa_restart.py
    openobserve_stall.py
    postgres_restart_outbox.py
  probes/
    http_probe.py
    nats_probe.py
    jetstream_probe.py
    db_probe.py

Add a manual just chaos recipe (do not add to just ci).


Step 3 — Scenario definitions (explicit, deterministic)

A) nats_partition_mesh

Compose file: infra/docker/docker-compose.mesh.yml (from Task 15)

Fault injection: block gateway traffic inside one node container (e.g., sea-nats-a-1) by dropping port 7522.

Invariant:

Probe:

Defaults:

B) opa_restart

Compose file: infra/docker/docker-compose.chaos.yml with profile chaos-opa

Fault injection: docker restart sea-opa.

Invariant:

C) openobserve_stall

Compose file: infra/docker/docker-compose.chaos.yml with profile chaos-observe

Fault injection: stop sea-openobserve container.

Invariant (collector-based, resolved):

D) postgres_restart_outbox

Compose file: infra/docker/docker-compose.chaos.yml with profile chaos-outbox

Fault injection: docker restart <postgres>.

Invariant:

Probe path (resolved):


Step 4 — Probe implementation (low LoE)

Prefer CLI-based probes for determinism:

If Python clients are used, keep deps minimal and pinned in tests/chaos/requirements.txt.


Step 5 — Deterministic pass/fail criteria

Each scenario must:


Step 6 — Commands

Add a manual recipe (not in just ci):

chaos scenario="nats_partition_mesh":
  CHAOS_SCENARIO= python tests/chaos/run_chaos.py

Acceptance criteria (from plan)