ADR-029: Observability Stack Architecture

Status: Accepted
Version: 1.0
Date: 2025-12-27
Supersedes: N/A
Related ADRs: ADR-028, ADR-022
Related PRDs: PRD-010


Context

Enterprise AI systems require comprehensive observability across metrics, logs, and traces. Traditional monitoring stacks (Prometheus + Grafana + Jaeger/ELK) create operational overhead through:

  1. Multiple data stores — Prometheus for metrics, Elasticsearch for logs, Jaeger for traces
  2. Configuration fragmentation — Separate pipelines for each signal type
  3. Compliance gaps — Manual auditing and evidence collection
  4. Cost complexity — Multiple SaaS subscriptions or self-hosted infrastructure

SEA-Forge™ requires:

Decision

Adopt a unified observability stack based on OpenTelemetry as the instrumentation standard:

Stack Components

Component Role Replaces
OpenTelemetry Instrumentation SDK and semantic conventions Custom instrumentation
OTel Collector Telemetry pipeline (receive, process, export) Multiple pipelines
OpenObserve Unified metrics, logs, traces backend and visualization Prometheus + Grafana
Vanta Continuous compliance automation (SOC2, ISO 27001, HIPAA) Manual audit trails
Logfire Python-native structured logging with trace correlation ELK/Loki

Reference Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
┌─────────────────────────────────────────────────────────────────┐
│  SEA-Forge™ Observability Architecture                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────┐   ┌──────────────────┐                    │
│  │ SEA™ Services     │   │ Policy Gateway   │                    │
│  │ (Python/Rust)    │   │ (Sidecar)        │                    │
│  └────────┬─────────┘   └────────┬─────────┘                    │
│           │                      │                               │
│           ▼                      ▼                               │
│  ┌──────────────────────────────────────────┐                   │
│  │     OpenTelemetry SDK (Auto + Manual)    │                   │
│  │     • Traces: HTTP, gRPC spans           │                   │
│  │     • Metrics: Counters, Gauges          │                   │
│  │     • Logs: Structured via Logfire       │                   │
│  └────────┬─────────────────────────────────┘                   │
│           │ OTLP (gRPC/HTTP)                                    │
│           ▼                                                     │
│  ┌──────────────────────────────────────────┐                   │
│  │     OTel Collector                       │                   │
│  │     • Receivers: OTLP, Prometheus        │                   │
│  │     • Processors: PII scrubbing, batch   │                   │
│  │     • Exporters: Multi-destination       │                   │
│  └────────┬──────────┬──────────┬───────────┘                   │
│           │          │          │                               │
│           ▼          ▼          ▼                               │
│  ┌────────────┐ ┌─────────┐ ┌─────────┐                         │
│  │ OpenObserve│ │  Vanta  │ │(Archive)│                         │
│  │ Metrics    │ │ Evidence│ │ S3/Blob │                         │
│  │ Logs       │ │ Auto    │ └─────────┘                         │
│  │ Traces     │ └─────────┘                                     │
│  │ Dashboards │                                                 │
│  └────────────┘                                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Rationale

  1. OpenTelemetry is the CNCF standard for telemetry instrumentation, ensuring vendor neutrality
  2. OTel Collector provides processing pipelines (PII scrubbing, semantic enrichment) before export
  3. OpenObserve unifies all three signals into one backend, reducing operational complexity
  4. Vanta automates compliance evidence collection directly from telemetry
  5. Logfire provides Python developers with structured logging that auto-correlates with traces

Why Not Traditional Stack?

Concern Prometheus/Grafana/Jaeger OpenTelemetry + OpenObserve + Vanta
Signal unification 3 separate systems 1 unified backend
Compliance evidence Manual export Automated via Vanta integration
Python DX Generic client libraries Logfire native experience
Cost $$ per component (or self-host) OpenObserve open-source + Vanta SaaS
Semantic context Labels only OTel semantic conventions + SEA™ envelope

Constraints (MUST/MUST NOT)

Critical for generator choices. These constraints flow directly into manifests and SEA-DSL.

Isomorphic Guarantees

Defines structure-preserving mappings from this ADR to implementation.

Spec Concept Implementation Target Mapping Rule
Metric definition OTel Meter + Instrument 1:1; metric name == spec metric
Trace span OTel Tracer + Span 1:1; span name == operation name
Structured log Logfire logger 1:1; log fields == spec fields
Semantic context OTel Resource Attributes 1:1; sea.* attribute prefix
Governance metric OTel Gauge/Counter 1:1; semantic envelope preserved

System Invariants

Non-negotiable truths that must hold across the system.

INV-ID Invariant Type Enforcement
INV-OBS-01 All telemetry must use OTLP protocol System OTel SDK configuration
INV-OBS-02 PII must be scrubbed before export Security Collector processor
INV-OBS-03 Semantic context must be preserved in all signals System Resource attribute schema
INV-OBS-04 Compliance evidence must flow to Vanta automatically Process Collector exporter config
INV-OBS-05 Logs must correlate with traces via trace_id/span_id System Logfire auto-correlation

Configuration Example

OTel Collector Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  # PII Scrubbing
  attributes:
    actions:
      - key: user.email
        action: hash
      - key: user.ip_address
        action: delete

  # Semantic Context Enrichment
  resource:
    attributes:
      - key: sea.platform
        value: sea-forge
        action: insert

exporters:
  otlp/openobserve:
    endpoint: https://api.openobserve.ai
    headers:
      Authorization: ${OPENOBSERVE_TOKEN}

  otlp/vanta:
    endpoint: https://evidence.vanta.com
    headers:
      Authorization: ${VANTA_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes, resource]
      exporters: [otlp/openobserve]

    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/openobserve, otlp/vanta]

    logs:
      receivers: [otlp]
      processors: [batch, attributes, resource]
      exporters: [otlp/openobserve]

Logfire Python Integration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# observability.py
import logfire

# Auto-correlate with OpenTelemetry traces
logfire.configure(
    token="${LOGFIRE_TOKEN}",
    service_name="sea-forge",
    send_to_logfire=True,
    console=logfire.ConsoleOptions(verbose=True),
)

# Structured logging with semantic context
logfire.info(
    "Policy evaluated",
    sea_domain="governance",
    sea_concept="PolicyEvaluation",
    decision="allowed",
    latency_ms=12.5,
)

Bounded Contexts Impacted

Consequences

Benefits

Trade-offs