LLM Provider Epic
User Journey
The LLM Provider bounded context provides a unified abstraction layer for accessing multiple Large Language Model providers (OpenAI, Anthropic, Ollama, OpenRouter, and 100+ others via LiteLLM). It enables chat completions and embedding generation with provider-agnostic interfaces, intelligent routing, fallback chains, policy governance integration, and comprehensive observability through OpenTelemetry.
Jobs to be Done & EARS Requirements
Job: Execute CompleteChat
User Story: As an application or service, I want to send chat messages and receive LLM completions through a unified interface, so that I can leverage multiple LLM providers without provider-specific code.
EARS Requirement:
- While the system is operational, when a
CompleteChat command (CMD-001) is received with ChatMessage[] containing role (system/user/assistant/function), content, and optional function_call, the llm-provider context shall:
- Validate messages are non-empty per
MessagesNonEmpty policy (POL-LLM-003)
- Select appropriate provider and model based on
LlmProvider configuration
- Apply provider-specific transformations via LiteLLM abstraction layer
- Route request through Policy Gateway in production (per
GatewayBypassOnlyInDev policy POL-LLM-005) or bypass in development
- Execute chat completion with configured timeout and retry logic from
ProviderConfig
- Create
ChatCompletion entity containing:
- Response content with finish_reason (stop/length/content_filter/function_call)
TokenUsage metrics (prompt_tokens, completion_tokens, total_tokens)
- Latency tracking in milliseconds
- Calculate cost based on provider-specific pricing and token usage
- Update
ProviderHealth with latency and error/success status
- Emit
ChatCompleted domain event to llm-provider.chat_completed.v1 NATS topic
- Create OpenTelemetry span with model, latency, and token usage attributes
- Handle fallback chain if primary provider fails (error or rate limit)
Job: Execute GenerateEmbedding
User Story: As an application or service, I want to generate vector embeddings for text input, so that I can perform semantic search and similarity operations.
EARS Requirement:
- While the system is operational, when a
GenerateEmbedding command (CMD-002) is received with input text and model specification, the llm-provider context shall:
- Validate input is non-empty per
EmbeddingInputNonEmpty policy (POL-LLM-004)
- Generate idempotency key as hash(input + model) per flow configuration
- Check cache for existing embedding with same idempotency key
- If cached, return existing
Embedding entity
- If not cached, select appropriate provider and embedding model from
ModelSpec
- Route through Policy Gateway in production (per
GatewayBypassOnlyInDev policy POL-LLM-005)
- Execute embedding generation via LiteLLM abstraction layer
- Create
Embedding entity containing:
- Vector array with dimension count
- Input text reference
- Model identifier used
- Validate vector dimensions match expected
ModelSpec dimensions
- Cache embedding with idempotency key
- Update
ProviderHealth with latency and status
- Emit
EmbeddingGenerated domain event to llm-provider.embedding_generated.v1 NATS topic
- Create OpenTelemetry span with model and dimension attributes
- Support batch embedding for multiple inputs
Job: Retrieve ListAvailableModels
User Story: As an application or UI component, I want to query available models and their capabilities, so that I can display model options and validate compatibility.
EARS Requirement:
- While the system is operational, when a
ListAvailableModels query (QRY-001) is received with optional provider filter, the llm-provider context shall:
- Query
ProviderConfiguration read model (strong consistency)
- Return array of
ModelSpec entries containing:
- model_name: Model identifier
- provider_id: Associated provider reference
- context_window: Maximum input context length
- max_tokens: Maximum output tokens (must be positive per
ModelHasPositiveMaxTokens policy POL-LLM-002)
- cost_metrics: Input/output token pricing
- feature_flags: Supported capabilities (chat, embedding, function_calling, streaming)
- is_available: Current availability status
- Filter by provider_id if specified
- Return OpenAI-compatible model list format for external integration
- Include provider name and type for each model
Job: Retrieve GetProviderHealth
User Story: As a monitoring system or load balancer, I want to check provider health status and performance metrics, so that I can make routing decisions and detect issues.
EARS Requirement:
- While the system is operational, when a
GetProviderHealth query (QRY-002) is received with provider identifier, the llm-provider context shall:
- Query
HealthCache read model (eventual consistency)
- Return
ProviderHealth entity containing:
- provider_id: Provider identifier
- is_healthy: Boolean health status
- error_count: Recent error count for circuit breaker
- average_latency: Average response time in milliseconds
- last_error: Most recent error message (if any)
- last_success_timestamp: Last successful request time
- rate_limit_status: Current rate limit state (OK/THROTTLED/BLOCKED)
- Return aggregate health if no specific provider requested
- Include fallback chain status if configured
- Calculate health score based on error rate and latency thresholds
Domain Entities Summary
Root Aggregates
- LlmProvider: Represents an LLM provider configuration with id, name, provider_type (OpenAI/Anthropic/Ollama/OpenRouter), endpoint, api_key_ref, and is_active status
- ChatCompletion: Result of chat completion requests with response, TokenUsage metrics, finish_reason, and latency tracking
- Embedding: Vector embeddings for text with vector array, dimensions, input text reference, and model identifier
Value Objects
- ProviderConfig: Runtime configuration per provider including timeout, retry count, and rate limits (RPM/TPM)
- ModelSpec: Defines capabilities and constraints of specific models (model_name, provider_id, context_window, max_tokens, cost metrics, feature flags)
- ChatMessage: Individual messages in conversation with role (system/user/assistant/function), content, and optional name/function_call
- TokenUsage: Track and cost token consumption (prompt_tokens, completion_tokens, total_tokens, calculated cost)
- FallbackChain: Ordered list of provider IDs for backup with error and rate limit fallback triggers
- ProviderHealth: Health monitoring with is_healthy status, error_count, average_latency, and last_error
Read Models
- ProviderConfiguration: Materialized view of provider/model configurations for query optimization with strong consistency guarantees
- HealthCache: Eventual-consistency read model for provider health monitoring with TTL and read-only usage
Policy Rules
- ProviderHasName (POL-LLM-001): ProviderConfig must have a name
- ModelHasPositiveMaxTokens (POL-LLM-002): ModelSpec must define positive maxTokens
- MessagesNonEmpty (POL-LLM-003): ChatCompletion messages must be non-empty
- EmbeddingInputNonEmpty (POL-LLM-004): Embedding input must be non-empty
- GatewayBypassOnlyInDev (POL-LLM-005): Policy Gateway bypass only allowed in development
Integration Points
- Policy Gateway: Routes all production requests through policy enforcement with circuit breaker pattern
- LiteLLM Abstraction: Unified interface for 100+ LLM providers (OpenAI, Anthropic, Ollama, OpenRouter, etc.)
- OpenTelemetry: Comprehensive observability with spans for model, latency, token usage, and exceptions
- Cache Layer: Idempotency key lookup and embedding cache (hit/miss handling for GenerateEmbedding)
- NATS Messaging: Event publishing to topics (llm-provider.chat_completed.v1, llm-provider.embedding_generated.v1)
- Query Context: Provides chat completion and embedding services to query pipeline
- Memory Context: Supplies embedding generation for vector indexing
- Ingest Context: Uses embeddings for document chunking and indexing