Ingest Epic
User Journey
The Ingest bounded context enables the ingestion, parsing, validation, and indexing of SEA-DSL policy files for runtime governance enforcement. It processes .sea files through a multi-stage pipeline: parsing to AST via tree-sitter-sea grammar, RDF triple generation for semantic storage in Oxigraph, and vector embedding generation via EmbeddingGemma for similarity search in pgvector.
Jobs to be Done & EARS Requirements
Job: Ingest SEA-DSL Policy File
User Story: As a policy author, I want to ingest a SEA-DSL policy file into the system, so that the policy can be queried, enforced, and governed during runtime.
EARS Requirement:
- While the system is operational, when a
PolicyFile is provided via IngestPolicy command, the ingest context shall:
- Parse SEA-DSL File (AC-001.1):
- Use tree-sitter-sea grammar to parse the
.sea file
- Produce valid AST JSON representation with syntax node hierarchy
- Return clear parse error with line/column information if invalid (AC-001.7)
- Generate RDF Triples (AC-001.2):
- Process AST JSON through triple generator
- Produce RDF triples in N-Triples format representing policy semantics
- Ensure deterministic parsing: same input produces same AST (NFR-001.1)
- Store Triples in Oxigraph (AC-001.3):
- Write RDF triples to Oxigraph RDF triple store
- Ensure triples are queryable via SPARQL interface
- Participate in saga coordination for cross-store consistency (NFR-001.3)
- Generate Embeddings (AC-001.4):
- Extract policy text content from AST
- Process through EmbeddingGemma model (llama.cpp)
- Produce 384-dimensional vector representation
- Ensure local-first processing with no external API dependencies (NFR-001.2)
- Store Embeddings in pgvector (AC-001.5):
- Write embedding vector to PostgreSQL with pgvector extension
- Enable searchable via cosine similarity operations
- Complete saga step with compensating action on failure (NFR-001.3)
- Handle Idempotent Re-Ingestion (AC-001.6):
- Check for existing
PolicyDocument matching content hash
- Confirm hash match with byte-for-byte content equality to detect collisions
- If content matches but metadata differs, append filename to
filenames[]/aliases
- Update ingestion_timestamp and source metadata; preserve original document_id and canonical_filename
- Apply versioning policy: maintain version_number + changelog on re-ingestion
- Emit Observability Signals (NFR-001.4):
- Emit structured logs with ingestion status
- Create OpenTelemetry trace with spans for each pipeline stage
- Record metrics: parse success rate, ingestion latency, storage write success
Error Handling Strategy
- Retries: transient failures retry up to 3 attempts with exponential backoff (base 250ms, max 5s).
- Retriable failures: network timeouts, connection resets, temporary DB/HTTP 5xx.
- Terminal failures: schema validation errors, invalid DSL, or repeated hash mismatch.
- Compensation: if Oxigraph writes succeed but pgvector write fails, delete/rollback Oxigraph triples or mark the ingestion as failed and schedule cleanup.
- Circuit breaker: trip on repeated store failures; fail fast with retry-after.
- Observability failures must not block ingestion success (best-effort logs/metrics with warnings).
- Client error response: structured error with code, message, stage, retryable flag, and policy IDs.
- AC-001.8: Partial failures are recovered via retry/compensation with testable outcomes per stage.
Job: Query Ingested Policy
User Story: As a governance service or query system, I want to retrieve ingested policy documents, so that I can enforce policies at runtime.
EARS Requirement:
- While the system is operational, when a
GetPolicy query is received with policy identifier, the ingest context shall:
- Retrieve
PolicyDocument aggregate matching the identifier
- Return policy metadata including:
- Document ID and content hash
- Original filename and ingestion timestamp
- AST JSON representation
- Oxigraph SPARQL endpoint reference
- pgvector embedding reference
- Return null or 404 if policy not found
Job: Search Policies by Similarity
User Story: As a query system, I want to find semantically similar policies using vector search, so that I can retrieve related governance rules.
EARS Requirement:
- While the system is operational, when a
SearchPolicies query is received with query text and similarity threshold, the ingest context shall:
- Generate embedding vector for query text using EmbeddingGemma
- Query pgvector for cosine similarity against stored policy embeddings
- Return array of
PolicyDocument entries exceeding similarity threshold
- Include similarity score and policy metadata in results
Job: Query Policy via SPARQL
User Story: As a semantic query system, I want to execute SPARQL queries against policy triples, so that I can perform complex semantic reasoning.
EARS Requirement:
- While the system is operational, when a
QueryTriples request is received with SPARQL query string, the ingest context shall:
- Route SPARQL query to Oxigraph triple store
- Execute query and retrieve matching RDF triples
- Return results in requested format (JSON, XML, N-Triples)
- Handle query errors with clear error messages
Domain Entities Summary
Root Aggregates
- PolicyDocument: Represents an ingested SEA-DSL policy with document ID, content hash, original filename, ingestion timestamp, AST JSON, Oxigraph reference, and pgvector embedding reference
- PolicyFile: Input entity representing the SEA-DSL file being ingested with file path and content
Value Objects
- AST Representation: JSON abstract syntax tree from tree-sitter-sea parsing
- RDF Triples: N-Triples format semantic representation for Oxigraph storage
- Embedding Vector: 384-dimensional vector from EmbeddingGemma for pgvector similarity search
Policy Rules
- IngestionRule: Validates that at least one entity is present in ingestion requests
Integration Points
- Semantic Core Context: Provides vocabulary alignment and semantic grounding for policy content
- Memory Context: Stores embedding vectors for similarity search operations
- Governance Context: Consumes ingested policies for runtime enforcement
- Oxigraph: RDF triple store for SPARQL-queryable semantic policy storage
- PostgreSQL + pgvector: Vector storage for embedding-based similarity search
- llama.cpp + EmbeddingGemma: Local embedding generation (no external API dependencies)
Success Metrics
- Parse Success Rate: >99% for valid
.sea files
- Ingestion Latency: <500ms for typical policy file (p95)
- Saga Completion: 99.9% of ingestions complete without manual intervention
- Idempotency Check: 100% (no duplicate records)
Non-Functional Requirements
- NFR-001.1: Deterministic parsing (same input → same AST)
- NFR-001.2: Local-first (no external API dependencies)
- NFR-001.3: Saga-based consistency with compensating actions and retries across Oxigraph + pgvector
- NFR-001.4: Observability (emit structured logs + OpenTelemetry traces)