LLM Provider Epic

User Journey

The LLM Provider bounded context provides a unified abstraction layer for accessing multiple Large Language Model providers (OpenAI, Anthropic, Ollama, OpenRouter, and 100+ others via LiteLLM). It enables chat completions and embedding generation with provider-agnostic interfaces, intelligent routing, fallback chains, policy governance integration, and comprehensive observability through OpenTelemetry.

Jobs to be Done & EARS Requirements

Job: Execute CompleteChat

User Story: As an application or service, I want to send chat messages and receive LLM completions through a unified interface, so that I can leverage multiple LLM providers without provider-specific code.

EARS Requirement:

While the system is operational, when a CompleteChat command (CMD-001) is received with ChatMessage[] containing role (system/user/assistant/function), content, and optional function_call, the llm-provider context shall:
1. Validate messages are non-empty per MessagesNonEmpty policy (POL-LLM-003)
2. Select appropriate provider and model based on LlmProvider configuration
3. Apply provider-specific transformations via LiteLLM abstraction layer
4. Route request through Policy Gateway in production (per GatewayBypassOnlyInDev policy POL-LLM-005) or bypass in development
5. Execute chat completion with configured timeout and retry logic from ProviderConfig
6. Create ChatCompletion entity containing:
  - Response content with finish_reason (stop/length/content_filter/function_call)
  - TokenUsage metrics (prompt_tokens, completion_tokens, total_tokens)
  - Latency tracking in milliseconds
7. Calculate cost based on provider-specific pricing and token usage
8. Update ProviderHealth with latency and error/success status
9. Emit ChatCompleted domain event to llm-provider.chat_completed.v1 NATS topic
10. Create OpenTelemetry span with model, latency, and token usage attributes
11. Handle fallback chain if primary provider fails (error or rate limit)

Job: Execute GenerateEmbedding

User Story: As an application or service, I want to generate vector embeddings for text input, so that I can perform semantic search and similarity operations.

EARS Requirement:

While the system is operational, when a GenerateEmbedding command (CMD-002) is received with input text and model specification, the llm-provider context shall:
1. Validate input is non-empty per EmbeddingInputNonEmpty policy (POL-LLM-004)
2. Generate idempotency key as hash(input + model) per flow configuration
3. Check cache for existing embedding with same idempotency key
4. If cached, return existing Embedding entity
5. If not cached, select appropriate provider and embedding model from ModelSpec
6. Route through Policy Gateway in production (per GatewayBypassOnlyInDev policy POL-LLM-005)
7. Execute embedding generation via LiteLLM abstraction layer
8. Create Embedding entity containing:
  - Vector array with dimension count
  - Input text reference
  - Model identifier used
9. Validate vector dimensions match expected ModelSpec dimensions
10. Cache embedding with idempotency key
11. Update ProviderHealth with latency and status
12. Emit EmbeddingGenerated domain event to llm-provider.embedding_generated.v1 NATS topic
13. Create OpenTelemetry span with model and dimension attributes
14. Support batch embedding for multiple inputs

Job: Retrieve ListAvailableModels

User Story: As an application or UI component, I want to query available models and their capabilities, so that I can display model options and validate compatibility.

EARS Requirement:

While the system is operational, when a ListAvailableModels query (QRY-001) is received with optional provider filter, the llm-provider context shall:
1. Query ProviderConfiguration read model (strong consistency)
2. Return array of ModelSpec entries containing:
  - model_name: Model identifier
  - provider_id: Associated provider reference
  - context_window: Maximum input context length
  - max_tokens: Maximum output tokens (must be positive per ModelHasPositiveMaxTokens policy POL-LLM-002)
  - cost_metrics: Input/output token pricing
  - feature_flags: Supported capabilities (chat, embedding, function_calling, streaming)
  - is_available: Current availability status
3. Filter by provider_id if specified
4. Return OpenAI-compatible model list format for external integration
5. Include provider name and type for each model

Job: Retrieve GetProviderHealth

User Story: As a monitoring system or load balancer, I want to check provider health status and performance metrics, so that I can make routing decisions and detect issues.

EARS Requirement:

While the system is operational, when a GetProviderHealth query (QRY-002) is received with provider identifier, the llm-provider context shall:
1. Query HealthCache read model (eventual consistency)
2. Return ProviderHealth entity containing:
  - provider_id: Provider identifier
  - is_healthy: Boolean health status
  - error_count: Recent error count for circuit breaker
  - average_latency: Average response time in milliseconds
  - last_error: Most recent error message (if any)
  - last_success_timestamp: Last successful request time
  - rate_limit_status: Current rate limit state (OK/THROTTLED/BLOCKED)
3. Return aggregate health if no specific provider requested
4. Include fallback chain status if configured
5. Calculate health score based on error rate and latency thresholds

Domain Entities Summary

Root Aggregates

LlmProvider: Represents an LLM provider configuration with id, name, provider_type (OpenAI/Anthropic/Ollama/OpenRouter), endpoint, api_key_ref, and is_active status
ChatCompletion: Result of chat completion requests with response, TokenUsage metrics, finish_reason, and latency tracking
Embedding: Vector embeddings for text with vector array, dimensions, input text reference, and model identifier

Value Objects

ProviderConfig: Runtime configuration per provider including timeout, retry count, and rate limits (RPM/TPM)
ModelSpec: Defines capabilities and constraints of specific models (model_name, provider_id, context_window, max_tokens, cost metrics, feature flags)
ChatMessage: Individual messages in conversation with role (system/user/assistant/function), content, and optional name/function_call
TokenUsage: Track and cost token consumption (prompt_tokens, completion_tokens, total_tokens, calculated cost)
FallbackChain: Ordered list of provider IDs for backup with error and rate limit fallback triggers
ProviderHealth: Health monitoring with is_healthy status, error_count, average_latency, and last_error

Read Models

ProviderConfiguration: Materialized view of provider/model configurations for query optimization with strong consistency guarantees
HealthCache: Eventual-consistency read model for provider health monitoring with TTL and read-only usage

Policy Rules

ProviderHasName (POL-LLM-001): ProviderConfig must have a name
ModelHasPositiveMaxTokens (POL-LLM-002): ModelSpec must define positive maxTokens
MessagesNonEmpty (POL-LLM-003): ChatCompletion messages must be non-empty
EmbeddingInputNonEmpty (POL-LLM-004): Embedding input must be non-empty
GatewayBypassOnlyInDev (POL-LLM-005): Policy Gateway bypass only allowed in development

Integration Points

Policy Gateway: Routes all production requests through policy enforcement with circuit breaker pattern
LiteLLM Abstraction: Unified interface for 100+ LLM providers (OpenAI, Anthropic, Ollama, OpenRouter, etc.)
OpenTelemetry: Comprehensive observability with spans for model, latency, token usage, and exceptions
Cache Layer: Idempotency key lookup and embedding cache (hit/miss handling for GenerateEmbedding)
NATS Messaging: Event publishing to topics (llm-provider.chat_completed.v1, llm-provider.embedding_generated.v1)
Query Context: Provides chat completion and embedding services to query pipeline
Memory Context: Supplies embedding generation for vector indexing
Ingest Context: Uses embeddings for document chunking and indexing