Observability Characterization

Answer: Yes
Evidence:
- sea-mq-worker explicitly reads Correlation-Id and Causation-Id headers from NATS messages and propagates them into tracing::span! (in apps/sea-mq-worker/src/inbox_consumer.rs).
- Values are persisted in outbox_events and inbox_messages tables, allowing SQL-based reconstruction of async flows.
- Python services use FastAPIInstrumentor, ensuring W3C trace context compatibility for the HTTP leg.

Answer: Yes
Evidence:
- Worker Latency vs Queue Lag: handler_latency_seconds histogram distinguishes slow downstream services (cause) from high queue depth (symptom).
- Transient vs Persistent: retry_distribution metrics in metrics.rs allow distinguishing between a massive outage (all retries failing) and flaky dependencies (some retries succeeding).

Answer: Partial (Missing Postgres Pool)
Evidence:
- ✅ Gateway/Services: Covered by FastAPIInstrumentor (Rate, Errors, Duration).
- ✅ Worker: Covered by metrics.rs (inbox_processed_total, inbox_dlq_total, handler_latency_seconds).
- ✅ NATS: Queue depth is observable via BacklogStats (application-side view of lag).
- ❌ Postgres: Application logic monitors table rows (backlog), but connection pool saturation (active/idle connections) is NOT currently exported by sea-mq-worker. This is a blind spot for “pool exhaustion” scenarios.

Answer: Yes
Evidence:
- Latency Injection: Verifiable via right-shift in handler_latency_seconds and increase in inbox_metrics -> oldest_at (lag).
- Fault Injection: Verifiable via inbox_dlq_total and outbox_events -> publish_error columns.
- The BacklogStats struct specifically exposes count and oldest_at, which acts as a direct “System Stability” gauge.

Answer: Yes
Evidence:
- Counters/Histograms: sea-mq-worker uses prometheus crate with atomic counters and static metric names.
- Cardinality Safety: event_type is used as a label, which is bounded by the schema (low cardinality).
- Logging: tracing-subscriber configured to info level by default; json encoding is efficient.

Answer: Yes
Evidence:
- The oldest_at metric in both Outbox and Inbox metrics endpoints provides a precise timestamp of the most lagging message.
- Recovery is defined as oldest_at catching up to now() and count returning to 0.

Decision Summary

Status: Ready for Chaos Testing
Gap Identified: Postgres Connection Pool metrics.
Mitigation: During chaos testing of Postgres (e.g., connection limits), the application may hang without a specific “pool exhausted” metric, but the readiness probe failure or timeout errors in logs will likely provide sufficient proxy evidence for now.
Action: Proceed with tool selection.