How To: Debug with OpenTelemetry

Use OpenTelemetry traces, metrics, and logs to debug production issues in SEA-Forge™ services.

Prerequisites

OpenObserve running locally or in production
OTel Collector configured
Service instrumented with OpenTelemetry SDK

Common Scenarios

1. Slow API Request

Symptom: User reports slow page load

Steps:

# 1. Find trace ID from logs or HTTP response header
curl -v https://api.sea-forge.local/cases/123
# Look for: X-Trace-Id: abc123...

# 2. Query OpenObserve for trace
# UI: http://localhost:5080
# Search: trace_id="abc123..."

# 3. Identify slowest spans in waterfall view
# Look for: database queries, external APIs, long-running tasks

# 4. Check span attributes for details
# Example: sql.query, http.url, error messages

Common Causes:

N+1 database queries
Missing database indexes
Slow external API calls
Large result set serialization

2. 500 Error Investigation

Symptom: Application returns 500 error

Steps:

# 1. Search logs for error
# OpenObserve UI: http://localhost:5080/logs
# Query: level='ERROR' AND timestamp > now() - 1h

# 2. Get trace ID from error context
# Log entry should include: trace_id, span_id

# 3. View full trace to see failure point
# Traces show: which service failed, exception details

# 4. Check span events for exception stack trace
# Span attributes include: exception.type, exception.message

3. Performance Regression

Symptom: Latency increased after deployment

Steps:

# 1. Compare metrics before/after deployment
# Query: http_request_duration_seconds_bucket{le="0.5"}
# Compare: time ranges before and after deploy

# 2. Identify services with increased latency
# Group by: service_name

# 3. Sample traces from both periods
# Before: Get 10 traces from previous hour
# After: Get 10 traces from current hour

# 4. Diff traces to find new spans or increased duration
# Look for: new database calls, changed algorithms

Query Patterns

Find Recent Errors

-- OpenObserve SQL query
SELECT timestamp, service, message, trace_id
FROM logs
WHERE level = 'ERROR'
  AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC
LIMIT 50;

Top Slowest Endpoints

# Prometheus query
topk(10,
  histogram_quantile(0.95,
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Error Rate by Service

sum by (service_name) (
  rate(http_requests_total{status=~"5.."}[5m])
) / sum by (service_name) (
  rate(http_requests_total[5m])
)

Best Practices

Always start with traces - They show the full request flow
Correlate with metrics - Confirm patterns across multiple requests
Check logs for context - Exception details, business logic errors
Use semantic attributes - Filter by sea.caseId, sea.conceptId

How To: Debug with OpenTelemetry

Prerequisites

Common Scenarios

1. Slow API Request

2. 500 Error Investigation

3. Performance Regression

Query Patterns

Find Recent Errors

Top Slowest Endpoints

Error Rate by Service

Best Practices

Related Documentation