Observability Handbook Epic
User Journey
The Observability bounded context provides comprehensive system visibility through OpenTelemetry traces, metrics, and logs. It enables semantic envelope propagation, collector configuration, and runbooks for performance investigation, log analysis, incident response, and trace debugging across all SEA™ services. See Metrics — Jobs to be Done for instrumentation and collection requirements.
Jobs to be Done & EARS Requirements
Job: Metrics — Jobs to be Done
User Story: As a Site Reliability Engineer, I want consistent metrics instrumentation and export, so that I can monitor system health and set actionable alerts.
EARS Requirement:
- While instrumenting services, when metrics are configured, the observability context shall:
- Counters:
- Emit counters for request_count, error_count, and retry_count
- Require labels: service.name, endpoint, status_code, and sea.domain
- Gauges:
- Emit gauges for queue_depth, memory_usage_mb, and cpu_percent
- Require labels: service.name and environment
- Histograms:
- Emit histograms for request_latency_ms and dependency_latency_ms
- Use standard buckets and include exemplars when available
- Export & Scrape:
- Export metrics via OTLP to the collector
- Default export interval 15s (configurable)
- Retention & Aggregation:
- Retain raw metrics for 7 days, downsampled aggregates for 30/90 days
- Require aggregation rollups by service and endpoint
- Alerting Thresholds:
- Define alerts for error_rate >1% over 5 minutes and p95 latency > target
- Require alerts to include service, endpoint, and runbook link
User Story: As a developer, I want to instrument my Python service with OpenTelemetry tracing, so that I can visualize request flows and diagnose performance issues.
EARS Requirement:
- While instrumenting services, when OpenTelemetry tracer is configured, the observability context shall:
- Configure TracerProvider:
- Create Resource with service.name and sea.platform attributes
- Add BatchSpanProcessor with OTLPSpanExporter
- Set exporter endpoint to localhost:4317 (or configured OTLP endpoint)
- Register provider as global tracer
- Create Spans:
- Get tracer from trace.get_tracer(name)
- Start spans with operation names and attributes
- Record exceptions in spans
- Set span status (OK, ERROR)
- Propagate Context:
- Inject trace context into outbound requests
- Extract trace context from inbound requests
- Use W3C Trace Context format
- Emit OTLP: Send spans to OTel Collector at configured endpoint
User Story: As a developer, I want to use Logfire for structured logging with semantic attributes, so that logs are queryable and correlated with traces.
EARS Requirement:
- While configuring logging, when Logfire is initialized, the observability context shall:
- Configure Logfire with:
service_name: Service identifier
send_to_logfire: False for local mode, True for cloud
console: Console output options with verbose level
- Emit Structured Logs:
- Log with semantic attributes (sea_domain, sea_concept, latency_ms)
- Include trace_id and span_id for correlation
- Support log levels (info, warning, error, debug)
- Query Logs:
- Filter by semantic attributes
- Correlate with traces via trace_id
- Aggregate metrics from log data
Job: Start OTel Collector
User Story: As a platform engineer, I want to start the OTel Collector to receive telemetry from all services, so that I have centralized observability data.
EARS Requirement:
- While starting collector, when OTel Collector is launched via docker-compose, the observability context shall:
- Load configuration from
infra/otel/otel-collector-config.yaml
- Expose receiver endpoints:
- OTLP gRPC: port 4317
- OTLP HTTP: port 4318
- Health check: port 13133
- Configure Pipeline:
- Receive OTLP traces/metrics/logs
- Process with batch and memory limiter processors
- Export to console (local) or OpenObserve (production)
- Verify Health:
- Respond to GET http://localhost:13133/ with 200 OK
- Expose zpages at http://localhost:55679/debug/tracez
- Disable zpages in production or protect it with network policies/authentication (localhost-only or admin subnet)
Job: Propagate Semantic Envelope
User Story: As a service developer, I want to propagate semantic context across service boundaries, so that traces are enriched with domain concepts.
EARS Requirement:
- While processing requests, when semantic envelope is present, the observability context shall:
- Extract Semantic Attributes:
sea.domain: Bounded context name
sea.concept: Domain concept identifier
sea.entity: Entity instance ID
sea.trace_id: Semantic trace identifier
- Inject into Spans:
- Add semantic attributes as span attributes
- Include in carrier propagation (HTTP headers, message metadata)
- Correlate Across Services:
- Maintain semantic attributes throughout trace
- Enable filtering by domain/concept in trace queries
- Support semantic trace search
User Story: As a Site Reliability Engineer, I want to investigate performance issues using trace data, so that I can identify bottlenecks and optimize latency.
EARS Requirement:
- While investigating performance, when trace analysis is performed, the observability context shall:
- Query Traces by:
- Trace ID for specific request
- Service name for all traces
- Time range for historical analysis
- Semantic attributes (domain, concept)
- Analyze Span Data:
- Identify slow spans (high duration)
- Find parent-child relationships
- Calculate service latency breakdown
- Identify Bottlenecks:
- Rank spans by duration
- Highlight spans with errors
- Show external service call latency
- Generate Report with:
- Total request duration
- Per-service latency breakdown
- Top 5 slowest operations
- Error rate and error distribution
Job: Analyze Logs for Troubleshooting
User Story: As a developer, I want to analyze logs to diagnose issues, so that I can understand what happened in a system.
EARS Requirement:
- While analyzing logs, when log query is performed, the observability context shall:
- Query Logs by:
- Service name and time range
- Log level (error, warning, info)
- Semantic attributes (sea_domain, sea_concept)
- Text search in log messages
- Correlate with Traces:
- Join logs with traces via trace_id
- Show log sequence within trace timeline
- Highlight errors in trace context
- Aggregate Metrics:
- Count logs by level and service
- Calculate error rate
- Identify patterns (spikes, repeated errors)
- Return Results with:
- Log entries with timestamp, level, message, attributes
- Associated trace IDs
- Aggregate statistics
Job: Respond to Incidents with Trace Data
User Story: As an incident responder, I want to use trace data to understand incident scope and impact, so that I can respond quickly and effectively.
EARS Requirement:
- While responding to incident, when incident traces are queried, the observability context shall:
- Identify Incident Traces:
- Query traces by error message or status
- Filter by time range of incident
- Group by affected services
- Analyze Impact:
- Count failed requests per service
- Calculate error rate during incident
- Identify error patterns and root causes
- Trace Root Cause:
- Follow trace from entry point to failure
- Identify service where error originated
- Show error details and stack traces
- Generate Incident Report with:
- Timeline of incident
- Affected services and users
- Root cause analysis
- Recommended remediation
Job: Debug Traces for Request Flow
User Story: As a developer, I want to debug a specific request by following its trace, so that I can understand the complete request lifecycle.
EARS Requirement:
- While debugging traces, when trace is queried by ID, the observability context shall:
- Retrieve complete trace by trace_id
- Display Span Tree:
- Show root span and all child spans
- Display parent-child relationships
- Indicate span duration on timeline
- Show Span Details:
- Span name, kind, status
- Start time and duration
- Attributes (including semantic attributes)
- Events and links
- Parent span ID
- Highlight Issues:
- Mark error spans in red
- Highlight slow spans in yellow
- Show exception details
Domain Entities Summary
Root Aggregates
- Trace: Distributed trace with trace_id, spans, duration, and root span
- Span: Single operation within trace with span_id, parent_id, name, kind, status, attributes, and events
- LogEntry: Structured log with timestamp, level, message, attributes, and trace_id correlation
- Metric: Time-series measurement with name, value, labels, and timestamp
Value Objects
- SemanticEnvelope: Semantic context with sea_domain, sea_concept, sea_entity, and sea_trace_id
- SpanAttribute: Key-value pair attached to span (string, number, boolean, array)
- LogAttribute: Key-value pair attached to log entry for filtering and correlation
- TraceQuery: Query filter by trace_id, service, time range, and semantic attributes
Policy Rules
- TraceIdContinuity: Trace ID must propagate across all service boundaries
- SemanticEnrichment: All spans must include semantic attributes when available
- StructuredLogging: All logs must use structured format with consistent attributes
- RetentionPeriod: Trace data retained for 7 days, logs for 30 days, metrics for 90 days (configurable per environment)
Integration Points
- OTel Collector: Central telemetry receiver and processor
- OpenObserve: Production observability backend for traces, metrics, logs
- Python Services: OpenTelemetry Python SDK for auto-instrumentation
- Logfire: Structured logging with semantic attributes
- Zpages: Debug UI for local trace inspection (http://localhost:55679/debug/tracez) with production access restricted
Success Metrics
- Trace Coverage: 100% of requests have complete traces
- Log Correlation: >95% of logs correlated with traces via trace_id
- Query Performance: <2 seconds for typical trace queries
- Incident MTTR: <30 minutes mean time to resolution using trace data
Non-Functional Requirements
- NFR-001: Collector handles 10,000 spans/second without data loss
- NFR-002: Queryability latency <5 seconds from span creation to availability (alerts governed by NFR-008)
- NFR-003: Semantic envelope propagates across all SEA™ service boundaries
- NFR-004: Collector health check responds in <100ms
- NFR-005: Authentication/authorization required for OTLP receivers, health check, and zpages in non-local environments
- NFR-006: Data privacy controls detect and scrub PII in traces, logs, and metrics before export
- NFR-007: Storage capacity planning enforces retention limits with alerting at 80% utilization
- NFR-008: Alerting latency <1 second from rule breach to notification