Observability Handbook Epic

User Journey

The Observability bounded context provides comprehensive system visibility through OpenTelemetry traces, metrics, and logs. It enables semantic envelope propagation, collector configuration, and runbooks for performance investigation, log analysis, incident response, and trace debugging across all SEA™ services. See Metrics — Jobs to be Done for instrumentation and collection requirements.

Jobs to be Done & EARS Requirements

Job: Metrics — Jobs to be Done

User Story: As a Site Reliability Engineer, I want consistent metrics instrumentation and export, so that I can monitor system health and set actionable alerts.

EARS Requirement:

While instrumenting services, when metrics are configured, the observability context shall:
1. Counters:
  - Emit counters for request_count, error_count, and retry_count
  - Require labels: service.name, endpoint, status_code, and sea.domain
2. Gauges:
  - Emit gauges for queue_depth, memory_usage_mb, and cpu_percent
  - Require labels: service.name and environment
3. Histograms:
  - Emit histograms for request_latency_ms and dependency_latency_ms
  - Use standard buckets and include exemplars when available
4. Export & Scrape:
  - Export metrics via OTLP to the collector
  - Default export interval 15s (configurable)
5. Retention & Aggregation:
  - Retain raw metrics for 7 days, downsampled aggregates for 30/90 days
  - Require aggregation rollups by service and endpoint
6. Alerting Thresholds:
  - Define alerts for error_rate >1% over 5 minutes and p95 latency > target
  - Require alerts to include service, endpoint, and runbook link

Job: Configure OpenTelemetry Tracing

User Story: As a developer, I want to instrument my Python service with OpenTelemetry tracing, so that I can visualize request flows and diagnose performance issues.

EARS Requirement:

While instrumenting services, when OpenTelemetry tracer is configured, the observability context shall:
1. Configure TracerProvider:
  - Create Resource with service.name and sea.platform attributes
  - Add BatchSpanProcessor with OTLPSpanExporter
  - Set exporter endpoint to localhost:4317 (or configured OTLP endpoint)
  - Register provider as global tracer
2. Create Spans:
  - Get tracer from trace.get_tracer(name)
  - Start spans with operation names and attributes
  - Record exceptions in spans
  - Set span status (OK, ERROR)
3. Propagate Context:
  - Inject trace context into outbound requests
  - Extract trace context from inbound requests
  - Use W3C Trace Context format
4. Emit OTLP: Send spans to OTel Collector at configured endpoint

Job: Configure Logfire for Python Logging

User Story: As a developer, I want to use Logfire for structured logging with semantic attributes, so that logs are queryable and correlated with traces.

EARS Requirement:

While configuring logging, when Logfire is initialized, the observability context shall:
1. Configure Logfire with:
  - service_name: Service identifier
  - send_to_logfire: False for local mode, True for cloud
  - console: Console output options with verbose level
2. Emit Structured Logs:
  - Log with semantic attributes (sea_domain, sea_concept, latency_ms)
  - Include trace_id and span_id for correlation
  - Support log levels (info, warning, error, debug)
3. Query Logs:
  - Filter by semantic attributes
  - Correlate with traces via trace_id
  - Aggregate metrics from log data

Job: Start OTel Collector

User Story: As a platform engineer, I want to start the OTel Collector to receive telemetry from all services, so that I have centralized observability data.

EARS Requirement:

While starting collector, when OTel Collector is launched via docker-compose, the observability context shall:
1. Load configuration from infra/otel/otel-collector-config.yaml
2. Expose receiver endpoints:
  - OTLP gRPC: port 4317
  - OTLP HTTP: port 4318
  - Health check: port 13133
3. Configure Pipeline:
  - Receive OTLP traces/metrics/logs
  - Process with batch and memory limiter processors
  - Export to console (local) or OpenObserve (production)
4. Verify Health:
  - Respond to GET http://localhost:13133/ with 200 OK
  - Expose zpages at http://localhost:55679/debug/tracez
  - Disable zpages in production or protect it with network policies/authentication (localhost-only or admin subnet)

Job: Propagate Semantic Envelope

User Story: As a service developer, I want to propagate semantic context across service boundaries, so that traces are enriched with domain concepts.

EARS Requirement:

While processing requests, when semantic envelope is present, the observability context shall:
1. Extract Semantic Attributes:
  - sea.domain: Bounded context name
  - sea.concept: Domain concept identifier
  - sea.entity: Entity instance ID
  - sea.trace_id: Semantic trace identifier
2. Inject into Spans:
  - Add semantic attributes as span attributes
  - Include in carrier propagation (HTTP headers, message metadata)
3. Correlate Across Services:
  - Maintain semantic attributes throughout trace
  - Enable filtering by domain/concept in trace queries
  - Support semantic trace search

Job: Investigate Performance Issues

User Story: As a Site Reliability Engineer, I want to investigate performance issues using trace data, so that I can identify bottlenecks and optimize latency.

EARS Requirement:

While investigating performance, when trace analysis is performed, the observability context shall:
1. Query Traces by:
  - Trace ID for specific request
  - Service name for all traces
  - Time range for historical analysis
  - Semantic attributes (domain, concept)
2. Analyze Span Data:
  - Identify slow spans (high duration)
  - Find parent-child relationships
  - Calculate service latency breakdown
3. Identify Bottlenecks:
  - Rank spans by duration
  - Highlight spans with errors
  - Show external service call latency
4. Generate Report with:
  - Total request duration
  - Per-service latency breakdown
  - Top 5 slowest operations
  - Error rate and error distribution

Job: Analyze Logs for Troubleshooting

User Story: As a developer, I want to analyze logs to diagnose issues, so that I can understand what happened in a system.

EARS Requirement:

While analyzing logs, when log query is performed, the observability context shall:
1. Query Logs by:
  - Service name and time range
  - Log level (error, warning, info)
  - Semantic attributes (sea_domain, sea_concept)
  - Text search in log messages
2. Correlate with Traces:
  - Join logs with traces via trace_id
  - Show log sequence within trace timeline
  - Highlight errors in trace context
3. Aggregate Metrics:
  - Count logs by level and service
  - Calculate error rate
  - Identify patterns (spikes, repeated errors)
4. Return Results with:
  - Log entries with timestamp, level, message, attributes
  - Associated trace IDs
  - Aggregate statistics

Job: Respond to Incidents with Trace Data

User Story: As an incident responder, I want to use trace data to understand incident scope and impact, so that I can respond quickly and effectively.

EARS Requirement:

While responding to incident, when incident traces are queried, the observability context shall:
1. Identify Incident Traces:
  - Query traces by error message or status
  - Filter by time range of incident
  - Group by affected services
2. Analyze Impact:
  - Count failed requests per service
  - Calculate error rate during incident
  - Identify error patterns and root causes
3. Trace Root Cause:
  - Follow trace from entry point to failure
  - Identify service where error originated
  - Show error details and stack traces
4. Generate Incident Report with:
  - Timeline of incident
  - Affected services and users
  - Root cause analysis
  - Recommended remediation

Job: Debug Traces for Request Flow

User Story: As a developer, I want to debug a specific request by following its trace, so that I can understand the complete request lifecycle.

EARS Requirement:

While debugging traces, when trace is queried by ID, the observability context shall:
1. Retrieve complete trace by trace_id
2. Display Span Tree:
  - Show root span and all child spans
  - Display parent-child relationships
  - Indicate span duration on timeline
3. Show Span Details:
  - Span name, kind, status
  - Start time and duration
  - Attributes (including semantic attributes)
  - Events and links
  - Parent span ID
4. Highlight Issues:
  - Mark error spans in red
  - Highlight slow spans in yellow
  - Show exception details

Domain Entities Summary

Root Aggregates

Trace: Distributed trace with trace_id, spans, duration, and root span
Span: Single operation within trace with span_id, parent_id, name, kind, status, attributes, and events
LogEntry: Structured log with timestamp, level, message, attributes, and trace_id correlation
Metric: Time-series measurement with name, value, labels, and timestamp

Value Objects

SemanticEnvelope: Semantic context with sea_domain, sea_concept, sea_entity, and sea_trace_id
SpanAttribute: Key-value pair attached to span (string, number, boolean, array)
LogAttribute: Key-value pair attached to log entry for filtering and correlation
TraceQuery: Query filter by trace_id, service, time range, and semantic attributes

Policy Rules

TraceIdContinuity: Trace ID must propagate across all service boundaries
SemanticEnrichment: All spans must include semantic attributes when available
StructuredLogging: All logs must use structured format with consistent attributes
RetentionPeriod: Trace data retained for 7 days, logs for 30 days, metrics for 90 days (configurable per environment)

Integration Points

OTel Collector: Central telemetry receiver and processor
OpenObserve: Production observability backend for traces, metrics, logs
Python Services: OpenTelemetry Python SDK for auto-instrumentation
Logfire: Structured logging with semantic attributes
Zpages: Debug UI for local trace inspection (http://localhost:55679/debug/tracez) with production access restricted

Success Metrics

Trace Coverage: 100% of requests have complete traces
Log Correlation: >95% of logs correlated with traces via trace_id
Query Performance: <2 seconds for typical trace queries
Incident MTTR: <30 minutes mean time to resolution using trace data

Non-Functional Requirements

NFR-001: Collector handles 10,000 spans/second without data loss
NFR-002: Queryability latency <5 seconds from span creation to availability (alerts governed by NFR-008)
NFR-003: Semantic envelope propagates across all SEA™ service boundaries
NFR-004: Collector health check responds in <100ms
NFR-005: Authentication/authorization required for OTLP receivers, health check, and zpages in non-local environments
NFR-006: Data privacy controls detect and scrub PII in traces, logs, and metrics before export
NFR-007: Storage capacity planning enforces retention limits with alerting at 80% utilization
NFR-008: Alerting latency <1 second from rule breach to notification