Ingest Epic

User Journey

The Ingest bounded context enables the ingestion, parsing, validation, and indexing of SEA-DSL policy files for runtime governance enforcement. It processes .sea files through a multi-stage pipeline: parsing to AST via tree-sitter-sea grammar, RDF triple generation for semantic storage in Oxigraph, and vector embedding generation via EmbeddingGemma for similarity search in pgvector.

Jobs to be Done & EARS Requirements

Job: Ingest SEA-DSL Policy File

User Story: As a policy author, I want to ingest a SEA-DSL policy file into the system, so that the policy can be queried, enforced, and governed during runtime.

EARS Requirement:

While the system is operational, when a PolicyFile is provided via IngestPolicy command, the ingest context shall:
1. Parse SEA-DSL File (AC-001.1):
  - Use tree-sitter-sea grammar to parse the .sea file
  - Produce valid AST JSON representation with syntax node hierarchy
  - Return clear parse error with line/column information if invalid (AC-001.7)
2. Generate RDF Triples (AC-001.2):
  - Process AST JSON through triple generator
  - Produce RDF triples in N-Triples format representing policy semantics
  - Ensure deterministic parsing: same input produces same AST (NFR-001.1)
3. Store Triples in Oxigraph (AC-001.3):
  - Write RDF triples to Oxigraph RDF triple store
  - Ensure triples are queryable via SPARQL interface
  - Participate in saga coordination for cross-store consistency (NFR-001.3)
4. Generate Embeddings (AC-001.4):
  - Extract policy text content from AST
  - Process through EmbeddingGemma model (llama.cpp)
  - Produce 384-dimensional vector representation
  - Ensure local-first processing with no external API dependencies (NFR-001.2)
5. Store Embeddings in pgvector (AC-001.5):
  - Write embedding vector to PostgreSQL with pgvector extension
  - Enable searchable via cosine similarity operations
  - Complete saga step with compensating action on failure (NFR-001.3)
6. Handle Idempotent Re-Ingestion (AC-001.6):
  - Check for existing PolicyDocument matching content hash
  - Confirm hash match with byte-for-byte content equality to detect collisions
  - If content matches but metadata differs, append filename to filenames[]/aliases
  - Update ingestion_timestamp and source metadata; preserve original document_id and canonical_filename
  - Apply versioning policy: maintain version_number + changelog on re-ingestion
7. Emit Observability Signals (NFR-001.4):
  - Emit structured logs with ingestion status
  - Create OpenTelemetry trace with spans for each pipeline stage
  - Record metrics: parse success rate, ingestion latency, storage write success

Error Handling Strategy

Retries: transient failures retry up to 3 attempts with exponential backoff (base 250ms, max 5s).
Retriable failures: network timeouts, connection resets, temporary DB/HTTP 5xx.
Terminal failures: schema validation errors, invalid DSL, or repeated hash mismatch.
Compensation: if Oxigraph writes succeed but pgvector write fails, delete/rollback Oxigraph triples or mark the ingestion as failed and schedule cleanup.
Circuit breaker: trip on repeated store failures; fail fast with retry-after.
Observability failures must not block ingestion success (best-effort logs/metrics with warnings).
Client error response: structured error with code, message, stage, retryable flag, and policy IDs.
AC-001.8: Partial failures are recovered via retry/compensation with testable outcomes per stage.

Job: Query Ingested Policy

User Story: As a governance service or query system, I want to retrieve ingested policy documents, so that I can enforce policies at runtime.

EARS Requirement:

While the system is operational, when a GetPolicy query is received with policy identifier, the ingest context shall:
1. Retrieve PolicyDocument aggregate matching the identifier
2. Return policy metadata including:
  - Document ID and content hash
  - Original filename and ingestion timestamp
  - AST JSON representation
  - Oxigraph SPARQL endpoint reference
  - pgvector embedding reference
3. Return null or 404 if policy not found

Job: Search Policies by Similarity

User Story: As a query system, I want to find semantically similar policies using vector search, so that I can retrieve related governance rules.

EARS Requirement:

While the system is operational, when a SearchPolicies query is received with query text and similarity threshold, the ingest context shall:
1. Generate embedding vector for query text using EmbeddingGemma
2. Query pgvector for cosine similarity against stored policy embeddings
3. Return array of PolicyDocument entries exceeding similarity threshold
4. Include similarity score and policy metadata in results

Job: Query Policy via SPARQL

User Story: As a semantic query system, I want to execute SPARQL queries against policy triples, so that I can perform complex semantic reasoning.

EARS Requirement:

While the system is operational, when a QueryTriples request is received with SPARQL query string, the ingest context shall:
1. Route SPARQL query to Oxigraph triple store
2. Execute query and retrieve matching RDF triples
3. Return results in requested format (JSON, XML, N-Triples)
4. Handle query errors with clear error messages

Domain Entities Summary

Root Aggregates

PolicyDocument: Represents an ingested SEA-DSL policy with document ID, content hash, original filename, ingestion timestamp, AST JSON, Oxigraph reference, and pgvector embedding reference
PolicyFile: Input entity representing the SEA-DSL file being ingested with file path and content

Value Objects

AST Representation: JSON abstract syntax tree from tree-sitter-sea parsing
RDF Triples: N-Triples format semantic representation for Oxigraph storage
Embedding Vector: 384-dimensional vector from EmbeddingGemma for pgvector similarity search

Policy Rules

IngestionRule: Validates that at least one entity is present in ingestion requests

Integration Points

Semantic Core Context: Provides vocabulary alignment and semantic grounding for policy content
Memory Context: Stores embedding vectors for similarity search operations
Governance Context: Consumes ingested policies for runtime enforcement
Oxigraph: RDF triple store for SPARQL-queryable semantic policy storage
PostgreSQL + pgvector: Vector storage for embedding-based similarity search
llama.cpp + EmbeddingGemma: Local embedding generation (no external API dependencies)

Success Metrics

Parse Success Rate: >99% for valid .sea files
Ingestion Latency: <500ms for typical policy file (p95)
Saga Completion: 99.9% of ingestions complete without manual intervention
Idempotency Check: 100% (no duplicate records)

Non-Functional Requirements

NFR-001.1: Deterministic parsing (same input → same AST)
NFR-001.2: Local-first (no external API dependencies)
NFR-001.3: Saga-based consistency with compensating actions and retries across Oxigraph + pgvector
NFR-001.4: Observability (emit structured logs + OpenTelemetry traces)