System Characterization Report
1. Runtime & Deployment Model
- Primary runtime(s):
- Rust (Async/Tokio): Used for the
sea-mq-worker (a robust event dispatcher and reliability layer).
- Python (3.11+ / FastAPI): Used for domain services (
knowledge-graph, policy-gateway, llm-provider, embedding, a2a).
- Node.js (Vite): Used for the
workbench frontend application.
- Deployment targets inferred: Docker (containers defined in
infra/docker/docker-compose.dev.yml and Dockerfile in apps/services).
- Process model:
- Worker Pool:
sea-mq-worker runs as a scalable consumer group processing NATS events.
- Service Mesh/Webhooks: Python services operate as independent HTTP servers (FastAPI) receiving work via HTTP calls from the worker (push-based dispatch) and direct synchronous requests.
2. Request & Workload Profile
- Entry points:
- HTTP (Synchronous):
policy-gateway acts as the primary API gateway for user requests.
- NATS (Asynchronous):
sea-mq-worker consumes events from JetStream.
- Internal Webhooks:
sea-mq-worker dispatches events to services via plain HTTP POSTs (e.g., http://localhost:8091/api/events/...).
- Expected concurrency model:
- Services: Async I/O (Python
async/await with uvicorn/FastAPI).
- Worker: Actor-like async tasks (Tokio) handling Inbox, Outbox, and DLQ streams concurrently.
- Latency sensitivity:
- Gateway: High (User interactive chat/response).
- Worker Dispatch: Medium (Eventual consistency allowed).
- Throughput sensitivity: Medium/High (Architecture is designed for high-volume event processing via NATS).
3. Steady-State Definition (Inferred)
- Critical user paths:
- User ->
workbench -> policy-gateway -> llm-provider (LLM Inference).
- Background: Service -> DB (Outbox) ->
sea-mq-worker -> NATS -> sea-mq-worker -> Service (Webhook) -> DB/KG.
- Read vs write dominance:
- Write-Heavy: Event streams, audit logs, and message passing.
- Read-Heavy: Knowledge Graph (SPARQL) queries and policy checks.
- Expected steady-state behaviors:
- Low queue depth in NATS JetStream.
- Sub-second processing time for
sea-mq-worker dispatch loop.
- Healthy HTTP 200 responses from internal service webhooks.
- Implicit SLO assumptions:
- Event processing retries are capped (default 3 retries) before moving to DLQ.
- Database connection pool is limited (max 10 connections/worker).
4. State & Data Characteristics
- Stateful vs stateless components:
- Stateful:
postgres (Relational Data, Outbox/Inbox tables), redis (Cache), nats (JetStream Persistence).
- Stateless: Python Services and Rust Worker (Application logic is stateless; state is externalized).
- Persistence layers:
- PostgreSQL: Primary application database.
- NATS JetStream: Durable message log.
- Oxigraph: Embedded Knowledge Graph storage (likely file-backed or in-memory within
knowledge-graph service).
- Idempotency guarantees:
- Outbox Pattern: Ensures “At-least-once” delivery from DB to NATS.
- Inbox Pattern: De-duplicates messages using
id as Nats-Msg-Id.
- Transaction boundaries:
- Strong consistency within a single service’s DB write (Atomic “State Change + Outbox Event” insert).
- Eventual consistency across services via NATS.
5. Failure Surface Mapping
- Network dependencies:
- Internal HTTP:
sea-mq-worker -> Services. Failure causes event processing retries.
- NATS Connection: Critical. Failure stops all async communication.
- External APIs:
- LLM Providers (OpenAI, Anthropic): High impact. Failure blocks core “Chat” functionality.
- Mitigation: Likely handled via
llm-provider retry logic (inferred).
- Message brokers / queues:
- NATS JetStream: If storage fills up or consumer lags, backpressure occurs.
sea-mq-worker handles this with explicit ack_wait (30s).
- Storage systems:
- PostgreSQL: Single Point of Failure (SPOF) for strict consistency.
- Blast Radius: Full system outage if DB is unreachable (Worker panic on startup/connection loss).
6. Resilience & Safety Mechanisms
- Timeouts:
- NATS Ack Wait: Explicit (30s) in
apps/sea-mq-worker.
- DB Connection: Explicit (Connect timeout).
- Retries:
- Worker Dispatch: Explicit (
INBOX_MAX_RETRIES env var, default 3).
- DLQ: Explicit (
DLQ_MAX_ATTEMPTS, default 6).
- Circuit breakers: Not implemented.
- Bulkheads:
- Postgres Pool: Explicit (
max_connections(10)).
- Worker Batch Size: Explicit (
BATCH_SIZE, default 100).
- Backpressure handling:
- NATS Consumer: Pull-based consumer (batches) prevents overwhelming the worker.
7. Observability Maturity
- Metrics present?: Yes.
sea-mq-worker runs a metrics server (Prometheus format) on port 9090. sea-otel-collector runs on 8888.
- Logging structure: Structured JSON logging (
tracing-subscriber in Rust, logging in Python).
- Tracing support: Note:
FastAPIInstrumentor (Python) and otel-collector (Infra) are present. sea-mq-worker has correlation_id and causation_id columns in SQL, supporting distributed tracing.
- Health checks: Deep usage.
docker-compose.yml defines specific health checks (pg_isready, curl, redis-cli) for all containers.
8. Risk Classification
- Classification: Medium
- Justification:
- The architecture is sophisticated (Event Sourcing / CQRS-lite / Outbox pattern), which introduces complexity in debugging distributed state issues.
- Dependencies on NATS and Postgres are critical.
- The split between Rust (Transport) and Python (Logic) creates a “Distributed Monolith” risk if interfaces drift, but the explicit
handler_registry mitigates this.
- Strong resilience patterns (DLQ, Retries, Transactional Outbox) reduce the risk of data loss significantly.
- Mandatory tool capabilities:
- Protocol Support: Must support NATS JetStream monitoring/injection and PostgreSQL inspection.
- Tracing: Must integrate with OpenTelemetry (OTLP).
- Async/Event-driven: Tools must handle “Fire and Forget” or “Request/Reply” via queues, not just simple HTTP stress testing.
- Disallowed tool characteristics:
- Tools that require modifying the binary (Rust) are difficult to use; sidecar/agent-based instrumentation is preferred.
- Non-negotiable safety requirements:
- Chaos engineering must respect the
outbox_events integrity; deleting rows manually from the DB will break the consistency guarantees.
- CI/CD integration needs:
- Docker-compose friendly tools for easy local validation.