How To: Debug with OpenTelemetry

Use OpenTelemetry traces, metrics, and logs to debug production issues in SEA-Forgeā„¢ services.


Prerequisites


Common Scenarios

1. Slow API Request

Symptom: User reports slow page load

Steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 1. Find trace ID from logs or HTTP response header
curl -v https://api.sea-forge.local/cases/123
# Look for: X-Trace-Id: abc123...

# 2. Query OpenObserve for trace
# UI: http://localhost:5080
# Search: trace_id="abc123..."

# 3. Identify slowest spans in waterfall view
# Look for: database queries, external APIs, long-running tasks

# 4. Check span attributes for details
# Example: sql.query, http.url, error messages

Common Causes:


2. 500 Error Investigation

Symptom: Application returns 500 error

Steps:

1
2
3
4
5
6
7
8
9
10
11
12
# 1. Search logs for error
# OpenObserve UI: http://localhost:5080/logs
# Query: level='ERROR' AND timestamp > now() - 1h

# 2. Get trace ID from error context
# Log entry should include: trace_id, span_id

# 3. View full trace to see failure point
# Traces show: which service failed, exception details

# 4. Check span events for exception stack trace
# Span attributes include: exception.type, exception.message

3. Performance Regression

Symptom: Latency increased after deployment

Steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 1. Compare metrics before/after deployment
# Query: http_request_duration_seconds_bucket{le="0.5"}
# Compare: time ranges before and after deploy

# 2. Identify services with increased latency
# Group by: service_name

# 3. Sample traces from both periods
# Before: Get 10 traces from previous hour
# After: Get 10 traces from current hour

# 4. Diff traces to find new spans or increased duration
# Look for: new database calls, changed algorithms

Query Patterns

Find Recent Errors

1
2
3
4
5
6
7
-- OpenObserve SQL query
SELECT timestamp, service, message, trace_id
FROM logs
WHERE level = 'ERROR'
  AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC
LIMIT 50;

Top Slowest Endpoints

# Prometheus query
topk(10,
  histogram_quantile(0.95,
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Error Rate by Service

sum by (service_name) (
  rate(http_requests_total{status=~"5.."}[5m])
) / sum by (service_name) (
  rate(http_requests_total[5m])
)

Best Practices

  1. Always start with traces - They show the full request flow
  2. Correlate with metrics - Confirm patterns across multiple requests
  3. Check logs for context - Exception details, business logic errors
  4. Use semantic attributes - Filter by sea.caseId, sea.conceptId