ADR-009: Walking Skeleton RAG Query Orchestration

Status: Accepted Version: 1.0 Date: 2026-01-01 Supersedes: N/A Related ADRs: ADR-006 (Ingest), ADR-007 (Memory), ADR-008 (Governance) Related PRDs: PRD-QUERY-001

Context

The Walking Skeleton requires end-to-end query orchestration to complete the golden thread. After ingesting policies (S1A), enabling retrieval (S1B), and governance (S1C), we need an orchestration layer that:

Accepts natural language queries
Retrieves relevant policies via semantic search
Enforces governance checks
Synthesizes answers using RAG

Per P001-SKELETON, this is the final component: “Ask ‘What is this policy?’ → Retrieve → Synthesize answer”

Decision

Implement a minimal RAG query service for Cycle S1D with:

Orchestration: Semantic Kernel (SK) framework
Query Flow: NL query → embeddings → similarity search → governance → synthesis
LLM: Local model via llama.cpp (Gemma-2B or Phi-3)
Response: Structured answer with sources and confidence

Rationale

Semantic Kernel as Orchestrator

Lightweight: Minimal dependencies, C#/Python/TS support
Pluggable: Swappable LLMs, memory stores, planners
Local-first: Supports llama.cpp backend
Structured: Built-in prompt templates, function calling

Alternatives Considered

Alternative	Rejected Because
LangChain	Heavier, Python-only, more complex
LlamaIndex	More opinionated, steeper learning curve
Custom orchestration	Reinventing the wheel, harder to maintain
OpenAI API	Non-local, non-deterministic, requires API keys

Consequences

Positive

Zero external API dependencies
Structured orchestration (prompts, planners)
Local LLM inference (privacy, no cost)
Foundation for complex multi-step queries
Pluggable architecture (easy to swap components)

Negative

Semantic Kernel learning curve
Limited LLM quality (small local models)
Inference latency on CPU
Memory constraints for large models

Implementation Notes

SK Framework: C# or Python runtime
LLM Backend: llama.cpp with Gemma-2B-Instruct
Query pipeline:
1. Parse natural language query
2. Generate query embedding
3. Semantic search (top-5 policies)
4. Governance check (OPA)
5. LLM synthesis with RAG context
6. Return structured answer
Response format: JSON with answer, sources, confidence

Success Criteria

Accept natural language query via API
Retrieve top-k relevant policies via semantic search
Enforce governance check before synthesis
Generate answer using local LLM + retrieved context
Return response with sources and confidence score
End-to-end query completes in <5s (p95)

Next Steps:

Define PRD-QUERY-001 (requirements)
Design SDS-QUERY-010 (service architecture)
Implement Cycle S1D (RAG orchestration)
Create end-to-end integration test