Observability Characterization
1. Can you reconstruct a single request end-to-end?
- Answer: Yes
- Evidence:
sea-mq-worker explicitly reads Correlation-Id and Causation-Id headers from NATS messages and propagates them into tracing::span! (in apps/sea-mq-worker/src/inbox_consumer.rs).
- Values are persisted in
outbox_events and inbox_messages tables, allowing SQL-based reconstruction of async flows.
- Python services use
FastAPIInstrumentor, ensuring W3C trace context compatibility for the HTTP leg.
2. Can you distinguish cause vs symptom during failure?
- Answer: Yes
- Evidence:
- Worker Latency vs Queue Lag:
handler_latency_seconds histogram distinguishes slow downstream services (cause) from high queue depth (symptom).
- Transient vs Persistent:
retry_distribution metrics in metrics.rs allow distinguishing between a massive outage (all retries failing) and flaky dependencies (some retries succeeding).
3. Do you have RED or USE metrics for all critical components?
- Answer: Partial (Missing Postgres Pool)
- Evidence:
- ✅ Gateway/Services: Covered by
FastAPIInstrumentor (Rate, Errors, Duration).
- ✅ Worker: Covered by
metrics.rs (inbox_processed_total, inbox_dlq_total, handler_latency_seconds).
- ✅ NATS: Queue depth is observable via
BacklogStats (application-side view of lag).
- ❌ Postgres: Application logic monitors table rows (backlog), but connection pool saturation (active/idle connections) is NOT currently exported by
sea-mq-worker. This is a blind spot for “pool exhaustion” scenarios.
4. Can chaos experiments be conclusively validated?
- Answer: Yes
- Evidence:
- Latency Injection: Verifiable via right-shift in
handler_latency_seconds and increase in inbox_metrics -> oldest_at (lag).
- Fault Injection: Verifiable via
inbox_dlq_total and outbox_events -> publish_error columns.
- The
BacklogStats struct specifically exposes count and oldest_at, which acts as a direct “System Stability” gauge.
5. Is observability passive under stress?
- Answer: Yes
- Evidence:
- Counters/Histograms:
sea-mq-worker uses prometheus crate with atomic counters and static metric names.
- Cardinality Safety:
event_type is used as a label, which is bounded by the schema (low cardinality).
- Logging:
tracing-subscriber configured to info level by default; json encoding is efficient.
6. Can you answer “did we recover, and how fast?”
- Answer: Yes
- Evidence:
- The
oldest_at metric in both Outbox and Inbox metrics endpoints provides a precise timestamp of the most lagging message.
- Recovery is defined as
oldest_at catching up to now() and count returning to 0.
Decision Summary
- Status: Ready for Chaos Testing
- Gap Identified: Postgres Connection Pool metrics.
- Mitigation: During chaos testing of Postgres (e.g., connection limits), the application may hang without a specific “pool exhausted” metric, but the
readiness probe failure or timeout errors in logs will likely provide sufficient proxy evidence for now.
- Action: Proceed with tool selection.