ADR-006: Walking Skeleton Ingest Pipeline
Status: Accepted
Version: 1.0
Date: 2026-01-01
Supersedes: N/A
Related ADRs: ADR-004 (Semantic Core Formalization)
Related PRDs: PRD-INGEST-001
Context
The Walking Skeleton requires a minimal end-to-end flow to validate system architecture and component connectivity. The first critical capability is ingesting SEA-DSL policy files, parsing them into structured representations, and indexing them for later retrieval and governance checks.
Per the Walking Skeleton plan (P001-SKELETON), we need to demonstrate:
- Parse
.sea files using tree-sitter
- Store RDF triples in Oxigraph
- Store embeddings in pgvector
- Enable downstream query and policy enforcement
Decision
Implement a minimal ingest pipeline for Cycle S1A with the following components:
- Parser: Use tree-sitter-based SEA-DSL parser to produce AST
- Triple Generator: Convert AST to RDF triples (Oxigraph format)
- Embedding Generator: Generate embeddings using EmbeddingGemma (llama.cpp)
- Dual Storage: Store triples in Oxigraph, embeddings in pgvector
- Idempotency: Support re-ingestion without duplication
Rationale
Local-First Stack
- tree-sitter: Zero-dependency parsing, deterministic AST generation
- Oxigraph: Embedded RDF store, no external service dependencies
- pgvector: PostgreSQL extension, leverages existing infrastructure
- EmbeddingGemma: Local embedding model via llama.cpp, no API calls
Alternatives Considered
| Alternative |
Rejected Because |
| Python parser (lark/PLY) |
Slower, harder to integrate with Rust/TS ecosystem |
| Remote embedding API (OpenAI) |
Non-deterministic, requires internet, cost |
| In-memory triple store |
Data loss on restart, no persistence |
| MongoDB for vectors |
Requires additional service, less mature vector search |
Consequences
Positive
- Zero external API dependencies
- Deterministic parsing and embedding
- Fast local development cycle
- Foundation for future governance checks
Negative
- Embedding quality limited by EmbeddingGemma model size
- Initial setup complexity (llama.cpp, Oxigraph bindings)
- pgvector performance limits at scale (acceptable for skeleton)
Implementation Notes
- Parser output: AST JSON → RDF triples (N-Triples format)
- Embedding dimension: 384 (EmbeddingGemma default)
- Storage transaction: Atomic write to both Oxigraph and pgvector
- Error handling: Fail fast on parse errors, log indexing failures
Success Criteria
Next Steps:
- Define PRD-INGEST-001 (requirements)
- Design SDS-INGEST-010 (service architecture)
- Implement Cycle S1A (parse + index)